```r

knitr::opts_chunk$set(warning=FALSE)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Needed libraries


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxubGlicmFyeShkcGx5cilcbmxpYnJhcnkoY291bnRyeWNvZGUpXG5saWJyYXJ5KG91dGxpZXJzKVxubGlicmFyeShjYXJldClcbmxpYnJhcnkoY2x1c3RlcilcbmxpYnJhcnkoZmFjdG9leHRyYSlcbmxpYnJhcnkoTmJDbHVzdClcbmxpYnJhcnkoXFxETXdSXFwpXG5saWJyYXJ5KFxcUldla2FcXClcbmxpYnJhcnkoXFxDNTBcXClcbmxpYnJhcnkoXFxycGFydFxcKVxubGlicmFyeShcXHRoZW1pc1xcKVxubGlicmFyeShyYXR0bGUpXG5saWJyYXJ5KHJwYXJ0LnBsb3QpXG5saWJyYXJ5KFJDb2xvckJyZXdlcilcbmBgYFxuYGBgIn0= -->

```r
```r
library(dplyr)
library(countrycode)
library(outliers)
library(caret)
library(cluster)
library(factoextra)
library(NbClust)
library(\DMwR\)
library(\RWeka\)
library(\C50\)
library(\rpart\)
library(\themis\)
library(rattle)
library(rpart.plot)
library(RColorBrewer)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


# phase 1

### Problem statement

Prediction of cyber security employees' salaries based on 11 attributes & grouping employees based on shared characteristics.

1.work_year

2.experience_level

3.employment_type

4.job_title

5.salary

6.salary_currency

7.salary_in_usd

8.employee_residence

9.remote_ratio

10.company_location

11.company_size

### Problem description

We are living in the "information age" or rather the "data age", meaning that everything around us revolves around data. The data has become one of the most valuable assets that a person or an organisation can have, since it has a significant value, losing it will lead to significant damages. Thus, most of the attacks nowadays are directed toward the data. To guard against such damages, organisations have realised the importance of protecting their digital assets, leading them to hire cybersecurity specialists. This made cybersecurity gain popularity among people so there's a growing tendency to study cybersecurity. Consequently this resulted in the emergence of plentiful professionals with various experience levels and skills in this field. As a result, organisations may find it difficult to decide a salary for job candidates solely based on the CV. also, since the attacks improve rapidly, organisations need to hire more employees in the far future to defend against such attacks but it's not an easy matter to predict the future payroll which may hinders some of the organisation's plans. Another issue arises when the decision makers in the organisation aren't fully aware of the different groups of employees and their differint needs. Their lack of awareness gives a chance for the competitor organisations to attract their employees to them by offering a better salary and privilages that match their needs.

### Data mining task

Prediction of the cyber security employees' salary categories (Very Low, Low, , High, Very High) using classification, and description of data characteristics and behavior and grouping data using clustering methods.

### Goal

Given the problems we discussed and In order to better understand this field, we decided to analyse a dataset of 1247 cybersecurity employees, containing information such as salary, job title, and experience level. Analysing this dataset can provide insightful predictions regarding the salary range of a cybersecurity employee and description of the cybersecurity market behavior by grouping the data, which can help in:

-   Market segmentation
-   Identify trends
-   Specifying common charactrestics among cybersecurity employees
-   Identify the main cybersecurity employee groups for better understanding their needs
-   Making better decisions
-   Making recruitment and hiring process easier and more efficient
-   Predicting the future payroll
-   Increasing loyalty
-   Increasing the satisfaction rate
-   Achieving fairness
## Data

## Source of data:

<https://www.kaggle.com/datasets/deepcontractor/cyber-security-salaries>

### Reading and viewing dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZGF0YXNldD0gcmVhZC5jc3YodXJsKFxcaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL1NhcmFoQWxoaW5kaS9ETV9wcm9qZWN0L21haW4vRGF0YSUyMFNldC9zYWxhcmllc19jeWJlci5jc3ZcXCksIGhlYWRlcj1UUlVFKVxuVmlldyhkYXRhc2V0KVxuXG5gYGBcbmBgYCJ9 -->

```r
```r
dataset= read.csv(url(\https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/salaries_cyber.csv\), header=TRUE)
View(dataset)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Original dataset

we will keep a copy of the original dataset before data preprocessing to use if needed at any time


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxub3JpZ2luYWxEYXRhc2V0PSBkYXRhc2V0XG5gYGBcbmBgYCJ9 -->

```r
```r
originalDataset= dataset

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


## General information about the dataset:

No. of attributes: 11\
Type of attributes: Ordinal , Nominal, and Numeric\
No. of objects: 1247\
Class label: salary_in_usd


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxubmNvbChkYXRhc2V0KVxubnJvdyhkYXRhc2V0KVxubmFtZXMoZGF0YXNldClcbnN0cihkYXRhc2V0KVxuYGBgXG5gYGAifQ== -->

```r
```r
ncol(dataset)
nrow(dataset)
names(dataset)
str(dataset)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Attributes' description table

+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| **Attribute Name** | **Description**                                             | **Data Type** | **Possible values**                                       |
+====================+=============================================================+===============+===========================================================+
| work_year          | The year in which salary was recorded                       | Numerical     | 2020 to 2022                                              |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| experience_level   | Expertise level of the employee                             | Ordinal       | En "Entry level"\                                         |
|                    |                                                             |               | MI "Mid level"\                                           |
|                    |                                                             |               | SE "Senior level"\                                        |
|                    |                                                             |               | EX "Executive level"                                      |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| employment_type    | The nature or category of employee's engagement in the job  | Nominal       | PT "Part time"\                                           |
|                    |                                                             |               | FT "Full time"\                                           |
|                    |                                                             |               | CT "Contract\                                             |
|                    |                                                             |               | FL"Freelancer"                                            |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| job_title          | The role worked in during the year                          | Nominal       | Different titles.                                         |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like Security Analyst, security researcher                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary             | The total gross salary amount paid                          | Numerical     | 1740-50001566                                             |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary_currency    | The currency of the salary paid to the employee             | Nominal       | Different currencies according to ISO 4217 currency code. |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like DE,CA                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary_in_usd      | The salary paid in United states dollar                     | Numerical     | 2000 to 365596.40                                         |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| employee_residence | Employee's primary country of residence                     | Nominal       | Different countries.                                      |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like US,AE                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| remote_ratio       | Percentage of online work by employee in the specified year | Numerical     | 0 "No remote work"\                                       |
|                    |                                                             |               | 50 "Partially remote"\                                    |
|                    |                                                             |               | 100 "Fully remote"                                        |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| company_location   | The country of the employer's main office                   | Nominal       | Different countries.                                      |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like BR,BW                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| company_size       | How big/small is the company                                | Ordinal       | S , M or L                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+

# phase 2

### sample of 20 employees from the dataset:

using sample_n(table,size) function and using (set_seed())


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMzApXG5zYW1wbGU9c2FtcGxlX24oZGF0YXNldCwyMClcbnByaW50KHNhbXBsZSlcbmBgYFxuYGBgIn0= -->

```r
```r
set.seed(30)
sample=sample_n(dataset,20)
print(sample)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Show the missing value:

if it is FALSE it means no null value,if it is TRUE there is null value. In our dataset there is no null values.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuaXMubmEoZGF0YXNldClcbnN1bShpcy5uYShkYXRhc2V0KSlcbmBgYFxuYGBgIn0= -->

```r
```r
is.na(dataset)
sum(is.na(dataset))

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Show the Min.,1st Qu.,Median,Mean ,3rd Qu.,Max. for each numeric column

The summary statistics for the dataset variables provide insights into the distribution of features. we can conclude the following:

In work_year: The data spans from the year 2020 to 2022 with Most data falling within the years 2021 and 2022, as indicated by both the median and mean being centered around 2021.

In salary: Salaries vary widely with a minimum of \$1,740 and a maximum of \$500 million. The median is \$120,000 which is a mid value, but the mean is notably higher at \$560,852 which might be duo to extreme values or notable skewness.

In salary_in_usd: The data has a median of \$110,000, and a mean of \$120,278, and the spread of salaries is observable in the difference between the median and mean.

In remote_ratio: Indicates the percentage of remote work ranging from 0% to 100%, with a median and 3rd quartile at 100%, and a mean of 71.49%, indicating a notable presence of remote work in the dataset, suggesting some variability.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc3VtbWFyeShkYXRhc2V0JHdvcmtfeWVhcilcbnN1bW1hcnkoZGF0YXNldCRzYWxhcnkpXG5zdW1tYXJ5KGRhdGFzZXQkc2FsYXJ5X2luX3VzZClcbnN1bW1hcnkoZGF0YXNldCRyZW1vdGVfcmF0aW8pXG5gYGBcbmBgYCJ9 -->

```r
```r
summary(dataset$work_year)
summary(dataset$salary)
summary(dataset$salary_in_usd)
summary(dataset$remote_ratio)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Show the variane of each numeric column

variance is to understand the spread or dispersion of the values in each column. A higher variance indicates that the values are more spread out from the mean and in our dataset the highest varied attribute is salary, while a lower variance indicates that the values are closer to the mean which in our datas it is work year attribute.

Variance results reveal that: -work years are to some extent consistent -salaries show notable variability and possible outliers -salaries in USD have a stable distribution -remote work ratio have moderate variability


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxudmFyKGRhdGFzZXQkd29ya195ZWFyKVxudmFyKGRhdGFzZXQkc2FsYXJ5KVxudmFyKGRhdGFzZXQkc2FsYXJ5X2luX3VzZClcbnZhcihkYXRhc2V0JHJlbW90ZV9yYXRpbylcbmBgYFxuYGBgIn0= -->

```r
```r
var(dataset$work_year)
var(dataset$salary)
var(dataset$salary_in_usd)
var(dataset$remote_ratio)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Visualization of relationship between some pairs of attributes:

Here we used boxplot to see the distribution between salary_in_usd and experience_level We observed that salaries vary depending on the level of experience,they are positively correlated.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYm94cGxvdChzYWxhcnlfaW5fdXNkIH4gZXhwZXJpZW5jZV9sZXZlbCwgZGF0YSA9IGRhdGFzZXQgLCB5YXh0PVxcblxcKVxubGFiZWxzPC0gcHJldHR5KGRhdGFzZXQkc2FsYXJ5X2luX3VzZClcbmxhYmVsczwtIHNhcHBseShsYWJlbHMsIGZ1bmN0aW9uKHgpIGZvcm1hdCh4LCBzY2llbnRpZmljID0gRkFMU0UpKVxuYXhpcyhzaWRlID0gMiwgYXQ9cHJldHR5KGRhdGFzZXQkc2FsYXJ5X2luX3VzZCksIGxhYmVscyA9IGxhYmVscyApXG5vcHRpb25zKHNjaXBlbiA9IDk5OSlcbmBgYFxuYGBgIn0= -->

```r
```r
boxplot(salary_in_usd ~ experience_level, data = dataset , yaxt=\n\)
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


Here we used boxplot to see the distribution between salary_in_usd and work_year We observed that 2021 salaries were close to each other but in 2022 the gap between them getting bigger.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYm94cGxvdChzYWxhcnlfaW5fdXNkIH4gd29ya195ZWFyLCBkYXRhID0gZGF0YXNldCAsIHlheHQ9XFxuXFwpXG5sYWJlbHM8LSBwcmV0dHkoZGF0YXNldCRzYWxhcnlfaW5fdXNkKVxubGFiZWxzPC0gc2FwcGx5KGxhYmVscywgZnVuY3Rpb24oeCkgZm9ybWF0KHgsIHNjaWVudGlmaWMgPSBGQUxTRSkpXG5heGlzKHNpZGUgPSAyLCBhdD1wcmV0dHkoZGF0YXNldCRzYWxhcnlfaW5fdXNkKSwgbGFiZWxzID0gbGFiZWxzIClcbm9wdGlvbnMoc2NpcGVuID0gOTk5KVxuYGBgXG5gYGAifQ== -->

```r
```r
boxplot(salary_in_usd ~ work_year, data = dataset , yaxt=\n\)
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


Here we used boxplot to see the distribution between salary_in_usd and employment_type We observed that Full Time (FT) offers more salary than the other categories.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYm94cGxvdChzYWxhcnlfaW5fdXNkIH4gZW1wbG95bWVudF90eXBlLCBkYXRhID0gZGF0YXNldCAsIHlheHQ9XFxuXFwpXG5sYWJlbHM8LSBwcmV0dHkoZGF0YXNldCRzYWxhcnlfaW5fdXNkKVxubGFiZWxzPC0gc2FwcGx5KGxhYmVscywgZnVuY3Rpb24oeCkgZm9ybWF0KHgsIHNjaWVudGlmaWMgPSBGQUxTRSkpXG5heGlzKHNpZGUgPSAyLCBhdD1wcmV0dHkoZGF0YXNldCRzYWxhcnlfaW5fdXNkKSwgbGFiZWxzID0gbGFiZWxzIClcbm9wdGlvbnMoc2NpcGVuID0gOTk5KVxuYGBgXG5gYGAifQ== -->

```r
```r
boxplot(salary_in_usd ~ employment_type, data = dataset , yaxt=\n\)
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


Here we used boxplot to see the distribution between salary_in_usd and company_size We observed that the larger the company is the higher the salary was.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYm94cGxvdChzYWxhcnlfaW5fdXNkIH4gY29tcGFueV9zaXplLCBkYXRhID0gZGF0YXNldCAsIHlheHQ9XFxuXFwpXG5sYWJlbHM8LSBwcmV0dHkoZGF0YXNldCRzYWxhcnlfaW5fdXNkKVxubGFiZWxzPC0gc2FwcGx5KGxhYmVscywgZnVuY3Rpb24oeCkgZm9ybWF0KHgsIHNjaWVudGlmaWMgPSBGQUxTRSkpXG5heGlzKHNpZGUgPSAyLCBhdD1wcmV0dHkoZGF0YXNldCRzYWxhcnlfaW5fdXNkKSwgbGFiZWxzID0gbGFiZWxzIClcbm9wdGlvbnMoc2NpcGVuID0gOTk5KSBcbmBgYFxuYGBgIn0= -->

```r
```r
boxplot(salary_in_usd ~ company_size, data = dataset , yaxt=\n\)
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999) 

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


## Data preproccessing

## Data Reduction

### Dimensionality Reduction

The "salary" column gives the same information as "salary_in_usd" it's just a matter of currency exchange, and we will eventually transform all the values in "salary" column to one common currency so we can properly deal with them. To further confirm that the two column are redundant, we will use the latest exchange rate for USD to the desired currency.

we will start by creating a temporary column named "converted_salary" to save the salary that we will get by using the exchange rate to convert the salary_in_usd to the salary with different currencies to compare with "salary" column


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuY29udmVydGVkRGF0YXNldD1kYXRhc2V0XG5cblxuY29udmVydGVkRGF0YXNldCRleGNoYW5nZV9yYXRlID0gZmFjdG9yKGNvbnZlcnRlZERhdGFzZXQkc2FsYXJ5X2N1cnJlbmN5LCBsZXZlbHM9YyhcXFVTRFxcLFxcQlJMXFwsXFxHQlBcXCxcXEVVUlxcLFxcSU5SXFwsXFxDQURcXCxcXENIRlxcLFxcREtLXFwsXFxTR0RcXCxcXEFVRFxcLFxcU0VLXFwsXFxNWE5cXCxcXElMU1xcLFxcUExOXFwsXFxOT0tcXCxcXElEUlxcLFxcTlpEXFwsXFxIVUZcXCxcXFpBUlxcLFxcVFdEXFwsXFxSVUJcXCksIGxhYmVscz1jKDEvMSwxLzAuMjAsMS8xLjIyLDEvMS4wNiwxLzAuMDEyLDEvMC43NCwxLzEuMTAsMS8wLjE0LDEvMC43MywxLzAuNjQsMS8wLjA5MCwxLzAuMDU3LDEvMC4yNiwxLzAuMjMsMS8wLjA5MywxLzAuMDAwMDY1LDEvMC42MCwxLzAuMDAyNywxLzAuMDUzLDEvMC4wMzEsMS8wLjAxMCkpXG5jb252ZXJ0ZWREYXRhc2V0JGV4Y2hhbmdlX3JhdGUgPSBhcy5udW1lcmljKGFzLmNoYXJhY3Rlcihjb252ZXJ0ZWREYXRhc2V0JGV4Y2hhbmdlX3JhdGUpKVxuY29udmVydGVkRGF0YXNldCRjb252ZXJ0ZWRfc2FsYXJ5ID0gY29udmVydGVkRGF0YXNldCRzYWxhcnlfaW5fdXNkKmNvbnZlcnRlZERhdGFzZXQkZXhjaGFuZ2VfcmF0ZVxuXG5cblxuc2V0LnNlZWQoMSlcbnNhbGFyeV9zYW1wbGUgPC0gc2FtcGxlX24oY29udmVydGVkRGF0YXNldFssYyhcXHNhbGFyeVxcLFxcY29udmVydGVkX3NhbGFyeVxcKV0sMTApXG5cbnByaW50KHNhbGFyeV9zYW1wbGUpXG5gYGBcbmBgYCJ9 -->

```r
```r
convertedDataset=dataset


convertedDataset$exchange_rate = factor(convertedDataset$salary_currency, levels=c(\USD\,\BRL\,\GBP\,\EUR\,\INR\,\CAD\,\CHF\,\DKK\,\SGD\,\AUD\,\SEK\,\MXN\,\ILS\,\PLN\,\NOK\,\IDR\,\NZD\,\HUF\,\ZAR\,\TWD\,\RUB\), labels=c(1/1,1/0.20,1/1.22,1/1.06,1/0.012,1/0.74,1/1.10,1/0.14,1/0.73,1/0.64,1/0.090,1/0.057,1/0.26,1/0.23,1/0.093,1/0.000065,1/0.60,1/0.0027,1/0.053,1/0.031,1/0.010))
convertedDataset$exchange_rate = as.numeric(as.character(convertedDataset$exchange_rate))
convertedDataset$converted_salary = convertedDataset$salary_in_usd*convertedDataset$exchange_rate



set.seed(1)
salary_sample <- sample_n(convertedDataset[,c(\salary\,\converted_salary\)],10)

print(salary_sample)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


as shown in the sample, the two columns are almost identical. This can be proved by correlation coefficient as well.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuY29ycmVsYXRpb24gPC0gY29yKGNvbnZlcnRlZERhdGFzZXQkc2FsYXJ5ICwgY29udmVydGVkRGF0YXNldCRjb252ZXJ0ZWRfc2FsYXJ5KVxucHJpbnQoY29ycmVsYXRpb24pXG5gYGBcbmBgYCJ9 -->

```r
```r
correlation <- cor(convertedDataset$salary , convertedDataset$converted_salary)
print(correlation)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


The correlation is so high but it hasn't reached 100% possibly due to rounding in the calculations and slight differences in the exchange rate over time.

To make the mining process more effiecent and has an improved quality, we decided to remove the "salary" column.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZGF0YXNldDwtZGF0YXNldFssLWMoNSldXG5gYGBcbmBgYCJ9 -->

```r
```r
dataset<-dataset[,-c(5)]

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Find the outliers and remove them:

We will show outliers with boxPlots and then remove them, to minimize noise and to get better analytical results when applying data mining techniques.

now we show (salary_in_usd) attributes' outliers. we can see that there are many outliers with exceptionally high values, thus we will remove them.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYm94cGxvdChkYXRhc2V0JHNhbGFyeV9pbl91c2QpXG5cblxuXG5PdXRTYWxhcnkgPSBvdXRsaWVyKGRhdGFzZXQkc2FsYXJ5X2luX3VzZCwgbG9naWNhbCA9VFJVRSlcbkZpbmRfb3V0bGllciA9IHdoaWNoKE91dFNhbGFyeSA9PVRSVUUsIGFyci5pbmQgPSBUUlVFKVxuZGF0YXNldD0gZGF0YXNldFstRmluZF9vdXRsaWVyLF1cblxuYGBgXG5gYGAifQ== -->

```r
```r
boxplot(dataset$salary_in_usd)



OutSalary = outlier(dataset$salary_in_usd, logical =TRUE)
Find_outlier = which(OutSalary ==TRUE, arr.ind = TRUE)
dataset= dataset[-Find_outlier,]

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


now we show (remote_ratio) attributes' outliers. we can see there aren't outliers in remote_ratio, thus we don't need the last step i.e: removing outliers' rows.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYm94cGxvdChkYXRhc2V0JHJlbW90ZV9yYXRpbylcblxuYGBgXG5gYGAifQ== -->

```r
```r
boxplot(dataset$remote_ratio)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


now we show (work_year) attributes' outliers. we can see there aren't outliers in work_year, thus we don't need the last step i.e: removing outliers' rows.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYm94cGxvdChkYXRhc2V0JHdvcmtfeWVhcilcblxuYGBgXG5gYGAifQ== -->

```r
```r
boxplot(dataset$work_year)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Concept hierarchy generation for nominal data

the columns "company_location" and "employee_residence" have the name of countries for the company and employee respectively. And these attributes can be generalized to higher-level concept that is region to help understand and analyze the dataset better and improve algorithm performance.

We will use the 7 regions as defined in the World Bank Development Indicators. These regions are:

1.  East Asia and Pacific: This region includes countries like China, Australia, Indonesia, Thailand, etc.

2.  Europe and Central Asia: This region includes countries like Germany, UK, Russia, Turkey, etc.

3.  Latin America & Caribbean: This region includes countries like Brazil, Mexico, Argentina, Cuba, etc.

4.  Middle East and North Africa: This region includes countries like Saudi Arabia, Egypt, Iran, Iraq, etc.

5.  North America: This is predominantly United States and Canada.

6.  South Asia: This region includes countries like India, Pakistan, Bangladesh, Sri Lanka, etc.

7.  Sub-Saharan Africa: This region includes countries like Nigeria, South Africa, Ethiopia, Kenya, etc.

Note: UM(The United States Minor Outlying Islands) and AQ(Antarctica) don't belong to any of these regions, thus, they will be used as they are.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5cbnVtPXdoaWNoKGRhdGFzZXQkY29tcGFueV9sb2NhdGlvbj09XFxVTVxcKVxuYXE9d2hpY2goZGF0YXNldCRjb21wYW55X2xvY2F0aW9uPT1cXEFRXFwpXG5cblxuZGF0YXNldCRjb21wYW55X2xvY2F0aW9uIDwtIGNvdW50cnljb2RlKGRhdGFzZXQkY29tcGFueV9sb2NhdGlvbiwgXFxpc28yY1xcLCBcXHJlZ2lvblxcKVxuZGF0YXNldCRlbXBsb3llZV9yZXNpZGVuY2UgPC0gY291bnRyeWNvZGUoZGF0YXNldCRlbXBsb3llZV9yZXNpZGVuY2UsIFxcaXNvMmNcXCwgXFxyZWdpb25cXClcblxuZGF0YXNldFt1bSxcXGNvbXBhbnlfbG9jYXRpb25cXF09XFxVTVxcXG5kYXRhc2V0W2FxLFxcY29tcGFueV9sb2NhdGlvblxcXT1cXEFRXFxcblxuYGBgXG5gYGAifQ== -->

```r
```r


um=which(dataset$company_location==\UM\)
aq=which(dataset$company_location==\AQ\)


dataset$company_location <- countrycode(dataset$company_location, \iso2c\, \region\)
dataset$employee_residence <- countrycode(dataset$employee_residence, \iso2c\, \region\)

dataset[um,\company_location\]=\UM\
dataset[aq,\company_location\]=\AQ\

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


Concept hierarchy generation can be done for "job_title" as well to improve interpretation and scalability. Also, most job titles are essentially the same job but with different names, so we can combine them into a higher-level jobs titles such as Architect, Analyst and Engineer etc.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuIyMgQ3JlYXRlIHRoZSBjYXRlZ29yaWVzIGJhc2VkIG9uIGpvYiByYW5rIFxuZGF0YXNldCRqb2JfdGl0bGUgPC0gaWZlbHNlKGdyZXBsKFxcQW5hbHlzdFxcLCBkYXRhc2V0JGpvYl90aXRsZSksIFxcQW5hbHlzdFxcLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBpZmVsc2UoZ3JlcGwoXFxBcmNoaXRlY3RcXCwgZGF0YXNldCRqb2JfdGl0bGUpLCBcXEFyY2hpdGVjdFxcLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgaWZlbHNlKGdyZXBsKFxcRW5naW5lZXJcXCwgZGF0YXNldCRqb2JfdGl0bGUpLCBcXEVuZ2luZWVyXFwsXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgaWZlbHNlKGdyZXBsKFxcTWFuYWdlcnxPZmZpY2VyfERpcmVjdG9yfExlYWRlclxcLCBkYXRhc2V0JGpvYl90aXRsZSksIFxcTGVhZGVyc2hpcFxcLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBpZmVsc2UoZ3JlcGwoXFxDb25zdWx0YW50fFNwZWNpYWxpc3RcXCwgZGF0YXNldCRqb2JfdGl0bGUpLCBcXENvbnN1bHRhbnQvU3BlY2lhbGlzdFxcLFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgaWZlbHNlKGdyZXBsKFxcQ3liZXJcXCwgZGF0YXNldCRqb2JfdGl0bGUpLCBcXEN5YmVyIFNlY3VyaXR5XFwsXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgXFxPdGhlcnNcXCkpKSkpKVxuXG5gYGBcbmBgYCJ9 -->

```r
```r
## Create the categories based on job rank 
dataset$job_title <- ifelse(grepl(\Analyst\, dataset$job_title), \Analyst\,
                                ifelse(grepl(\Architect\, dataset$job_title), \Architect\,
                                       ifelse(grepl(\Engineer\, dataset$job_title), \Engineer\,
                                              ifelse(grepl(\Manager|Officer|Director|Leader\, dataset$job_title), \Leadership\,
                                                     ifelse(grepl(\Consultant|Specialist\, dataset$job_title), \Consultant/Specialist\,
                                                            ifelse(grepl(\Cyber\, dataset$job_title), \Cyber Security\,
                                                                   \Others\))))))

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


## Encoding categorical data

To deal with columns with character type we are going to encode them, because most machine learning algorithms are designed to work with factors data rather than character data and it improves performance and Interpretability of data as well.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZGF0YXNldCRqb2JfdGl0bGUgIDwtIGZhY3RvcihkYXRhc2V0JGpvYl90aXRsZSlcblxuZGF0YXNldCRleHBlcmllbmNlX2xldmVsID0gZmFjdG9yKGRhdGFzZXQkZXhwZXJpZW5jZV9sZXZlbCwgbGV2ZWxzPWMoXFxFTlxcLCBcXE1JXFwsIFxcU0VcXCwgXFxFWFxcKSwgbGFiZWxzPWMoMSwyLDMsNCkpXG5cbmRhdGFzZXQkZW1wbG95bWVudF90eXBlICA8LSBmYWN0b3IoZGF0YXNldCRlbXBsb3ltZW50X3R5cGUpXG5cbmRhdGFzZXQkZW1wbG95ZWVfcmVzaWRlbmNlICA8LSBmYWN0b3IoZGF0YXNldCRlbXBsb3llZV9yZXNpZGVuY2UpXG5cbmRhdGFzZXQkY29tcGFueV9sb2NhdGlvbiAgPC0gZmFjdG9yKGRhdGFzZXQkY29tcGFueV9sb2NhdGlvbilcblxuZGF0YXNldCRzYWxhcnlfY3VycmVuY3kgIDwtIGZhY3RvcihkYXRhc2V0JHNhbGFyeV9jdXJyZW5jeSlcblxuZGF0YXNldCRqb2JfdGl0bGUgIDwtIGZhY3RvcihkYXRhc2V0JGpvYl90aXRsZSlcblxuXG5kYXRhc2V0JGNvbXBhbnlfc2l6ZSA9IGZhY3RvcihkYXRhc2V0JGNvbXBhbnlfc2l6ZSwgbGV2ZWxzPWMoXFxTXFwsXFxNXFwsXFxMXFwpLCBsYWJlbHM9YygxLDIsMykpXG5cblxuZGF0YXNldCRqb2JfdGl0bGUgIDwtIGZhY3RvcihkYXRhc2V0JGpvYl90aXRsZSlcblxuYGBgXG5gYGAifQ== -->

```r
```r
dataset$job_title  <- factor(dataset$job_title)

dataset$experience_level = factor(dataset$experience_level, levels=c(\EN\, \MI\, \SE\, \EX\), labels=c(1,2,3,4))

dataset$employment_type  <- factor(dataset$employment_type)

dataset$employee_residence  <- factor(dataset$employee_residence)

dataset$company_location  <- factor(dataset$company_location)

dataset$salary_currency  <- factor(dataset$salary_currency)

dataset$job_title  <- factor(dataset$job_title)


dataset$company_size = factor(dataset$company_size, levels=c(\S\,\M\,\L\), labels=c(1,2,3))


dataset$job_title  <- factor(dataset$job_title)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Discretization of salaray_in_usd attribute

by calculating breaks based on quartiles


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuYnJlYWtzIDwtIHF1YW50aWxlKGRhdGFzZXQkc2FsYXJ5X2luX3VzZCwgXG4gICAgICAgICAgICAgICAgICAgcHJvYnMgPSBjKDAsIC4yNSwgLjUsIC43NSwgLjk1LCAxKSwgXG4gICAgICAgICAgICAgICAgICAgbmEucm0gPSBUUlVFKVxuXG5cbmRhdGFzZXQkc2FsYXJ5X2luX3VzZCA8LSBjdXQoZGF0YXNldCRzYWxhcnlfaW5fdXNkLCBcbiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGJyZWFrcyA9IGJyZWFrcywgXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBpbmNsdWRlLmxvd2VzdCA9IFRSVUUsIFxuICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbGFiZWxzPWMoXFxWZXJ5IExvd1xcLCBcXExvd1xcLCBcXE1lZGl1bVxcLCBcXEhpZ2hcXCwgXFxWZXJ5IEhpZ2hcXCkpXG5cblxuYGBgXG5gYGAifQ== -->

```r
```r
breaks <- quantile(dataset$salary_in_usd, 
                   probs = c(0, .25, .5, .75, .95, 1), 
                   na.rm = TRUE)


dataset$salary_in_usd <- cut(dataset$salary_in_usd, 
                                       breaks = breaks, 
                                       include.lowest = TRUE, 
                                       labels=c(\Very Low\, \Low\, \Medium\, \High\, \Very High\))

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


### Normalization:

to change the scale of numeric attributes (remote_ratio and work_year) to a scale of [-1,1] to give them equal weight


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZGF0YXNldCBbLCBjKFxcd29ya195ZWFyXFwgLCBcXHJlbW90ZV9yYXRpb1xcKV0gPSBzY2FsZShkYXRhc2V0IFssIGMoXFx3b3JrX3llYXJcXCAsIFxccmVtb3RlX3JhdGlvXFwpXSlcbmBgYFxuYGBgIn0= -->

```r
```r
dataset [, c(\work_year\ , \remote_ratio\)] = scale(dataset [, c(\work_year\ , \remote_ratio\)])

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


## Feature Selection

we will implement feature selection to remove redundant or irrelevant attributes from the data set to get the smallest subset that can help us get the most accurate predictions for our target class(salary_in_usd) and decrease the time that it takes the classifier to process the data.

we will use RFE(Recursive feature elimination) which is a wrapper method for the feature selection. Since the RFE function have multiple control options we need to specify the options that we want. We will choose "Random Forest" because it has high accuracy, can handle categorical data.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuY29udHJvbCA8LSByZmVDb250cm9sKGZ1bmN0aW9ucyA9IHJmRnVuY3MsIFxuICAgICAgICAgICAgICAgICAgICAgIG1ldGhvZCA9IFxccmVwZWF0ZWRjdlxcLFxuICAgICAgICAgICAgICAgICAgICAgIHJlcGVhdHMgPSA1LCBcbiAgICAgICAgICAgICAgICAgICAgICBudW1iZXIgPSAxMClcbmBgYFxuYGBgIn0= -->

```r
```r
control <- rfeControl(functions = rfFuncs, 
                      method = \repeatedcv\,
                      repeats = 5, 
                      number = 10)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


First we save the features to be used in the feature selection(every attributes except the class label "salary_in_usd") in variable x, and the class label in variable y. Then split the data to 80% training and 20% test.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxueCA8LSBkYXRhc2V0ICU+JVxuICBzZWxlY3QoLXNhbGFyeV9pbl91c2QpICU+JVxuICBhcy5kYXRhLmZyYW1lKClcblxuIyBUYXJnZXQgdmFyaWFibGVcbnkgPC0gZGF0YXNldCRzYWxhcnlfaW5fdXNkXG5cbiMgVHJhaW5pbmc6IDgwJTsgVGVzdDogMjAlXG5zZXQuc2VlZCgyMDIxKVxuaW5UcmFpbiA8LSBjcmVhdGVEYXRhUGFydGl0aW9uKHksIHAgPSAuODAsIGxpc3QgPSBGQUxTRSlbLDFdXG5cbnhfdHJhaW4gPC0geFsgaW5UcmFpbiwgXVxueF90ZXN0ICA8LSB4Wy1pblRyYWluLCBdXG5cbnlfdHJhaW4gPC0geVsgaW5UcmFpbl1cbnlfdGVzdCAgPC0geVstaW5UcmFpbl1cblxuYGBgXG5gYGAifQ== -->

```r
```r
x <- dataset %>%
  select(-salary_in_usd) %>%
  as.data.frame()

# Target variable
y <- dataset$salary_in_usd

# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = .80, list = FALSE)[,1]

x_train <- x[ inTrain, ]
x_test  <- x[-inTrain, ]

y_train <- y[ inTrain]
y_test  <- y[-inTrain]

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


after splitting the data, now we can perform the selection using rfe


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMSlcbnJlc3VsdF9yZmUxIDwtIHJmZSh4ID0geF90cmFpbiwgXG4gICAgICAgICAgICAgICAgICAgeSA9IHlfdHJhaW4sIFxuICAgICAgICAgICAgICAgICAgIHNpemVzID0gYygxOjkpLFxuICAgICAgICAgICAgICAgICAgIHJmZUNvbnRyb2wgPSBjb250cm9sKVxuXG5yZXN1bHRfcmZlMVxuXG5wcmVkaWN0b3JzKHJlc3VsdF9yZmUxKVxuXG5gYGBcbmBgYCJ9 -->

```r
```r
set.seed(1)
result_rfe1 <- rfe(x = x_train, 
                   y = y_train, 
                   sizes = c(1:9),
                   rfeControl = control)

result_rfe1

predictors(result_rfe1)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


The results show that all the remaining attributes, except for "employment_type", are selected. This is logical, as 98% of the rows have the value "FT", as shown in the table below. Due to the low variance, we decided to remove this attribute.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxudGFibGUoZGF0YXNldCRlbXBsb3ltZW50X3R5cGUpXG5gYGBcbmBgYCJ9 -->

```r
```r
table(dataset$employment_type)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->



<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZGF0YXNldDwtZGF0YXNldFssLXdoaWNoKG5hbWVzKGRhdGFzZXQpPT1cXGVtcGxveW1lbnRfdHlwZVxcKV1cbmBgYFxuYGBgIn0= -->

```r
```r
dataset<-dataset[,-which(names(dataset)==\employment_type\)]

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


# phase 3

During this phase, our focus will be on clustering and classification techniques to analyze the data. The primary objectives are to identify distinct groups within the dataset through clustering, classify data objects into meaningful categories, and apply different evaluation methods to assess the accuracy and precision of both classification and clustering results. We aim to gain deeper insights into the data and discover patterns.

## Retreive our preprocessed dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG4jIFJlYWQgdGhlIENTViBmaWxlIGZyb20gZ2l0aHViXG5kYXRhc2V0Mj0gcmVhZC5jc3YodXJsKFxcaHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL1NhcmFoQWxoaW5kaS9ETV9wcm9qZWN0L21haW4vRGF0YSUyMFNldC9wcmVwcm9jZXNzZWREYXRhc2V0LmNzdlxcKSwgaGVhZGVyPVRSVUUpXG5cbiMgSWRlbnRpZnkgdGhlIGNoYXJhY3RlciB2YXJpYWJsZXMgaW4gdGhlIGRhdGFzZXQyXG5jaGFyX3ZhcnMgPC0gc2FwcGx5KGRhdGFzZXQyLCBpcy5jaGFyYWN0ZXIpXG5cbiMgQ29udmVydCB0aGUgaWRlbnRpZmllZCBjaGFyYWN0ZXIgdmFyaWFibGVzIGluIGRhdGFzZXQyIHRvIGZhY3RvcnNcbmRhdGFzZXQyW2NoYXJfdmFyc10gPC0gbGFwcGx5KGRhdGFzZXQyW2NoYXJfdmFyc10sIGFzLmZhY3RvcilcblxuYGBgXG5gYGAifQ== -->

```r
```r

# Read the CSV file from github
dataset2= read.csv(url(\https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/preprocessedDataset.csv\), header=TRUE)

# Identify the character variables in the dataset2
char_vars <- sapply(dataset2, is.character)

# Convert the identified character variables in dataset2 to factors
dataset2[char_vars] <- lapply(dataset2[char_vars], as.factor)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


## balancing data

To resolve the problem of class imbalance in the dataset, we will use SMOTE() method that oversample the minority class by creating synthetic samples using the existing minority class samples


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5iYWxhbmNlZF9kYXRhc2V0IDwtIFNNT1RFKHNhbGFyeV9pbl91c2QgfiAuLCBkYXRhc2V0MiwgcGVyYy5vdmVyID0gMzAwLCBwZXJjLnVuZGVyPTUwMCwgayA9IDEwKVxuYGBgXG5gYGAifQ== -->

```r
```r
set.seed(10)
balanced_dataset <- SMOTE(salary_in_usd ~ ., dataset2, perc.over = 300, perc.under=500, k = 10)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->



## Data mining techniques


The goal of all preceding steps is to properly prepare the dataset for the classification and clustering, which constitutes one of our primary mining objectives. In this section, we will employ various attribute selection methods such as the Gini index, Gain ratio, and information gain to construct a decision tree model. We will thoroughly evaluate its performance, and if it proves effective, it can subsequently be utilized to classify new instances with unknown class labels. The process to predict is as follow, divide the data into training and data sets then training the model using the training set and test its performance using the test set.

since our dataset is small, we decided to use K-fold Cross-validation as a dataset partioning method. for each attribute selection method we will try different K size (10,5, and 3)

in all this section we will be using train and trainControl functions of caret package to produce decision trees. for Gini index the method will be "rpart” from "rpart"  package and for Gain ratio it's "j48" from "RWeka" package as for information gain the method is "C5.0" from "C50" package  .



Data clustering is a process to partition data into groups or clusters,it is an unsupervised learning process, which is excuted without knowing the class label of the training data. The data in the same group "cluster" are similar to one another and different from data in other clusters. And for this data mining task We will utilize k-means clustering. 
We will use the method "fviz_nbclust"  of the package "factoextra" to find the number of clusters based on the elbow method and the Silhouette coefficient. To use the kmeans clustering we will utilize the method “kmeans” of the package “stats”. To visualize the clusters, we will use the method “fviz_cluster” from the package “factoextra”. And finally to find the average silhouette for each cluster the method “silhouette” from the package “cluster” will be used. 


## Evaluation and Comparison



### Classification



the following function will be used to compute average sensitivity and Specificity:


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5cbm1hY3JvID0gZnVuY3Rpb24obWF0cml4KXtcbiAgXG4gIHN1bVNlbj0wXG4gIFxuICBmb3IgKGkgaW4gMTo1KSB7XG4gICBzdW1TZW4gPSBzdW1TZW4gKyBtYXRyaXgkYnlDbGFzc1tpLDFdIFxuICB9XG4gIFxuICBcbiAgYXZnU2VuID0gc3VtU2VuLzVcbiAgXG4gIHN1bVNwZWM9MFxuICBcbiAgZm9yIChpIGluIDE6NSkge1xuICAgc3VtU3BlYyA9IHN1bVNwZWMgKyBtYXRyaXgkYnlDbGFzc1tpLDJdIFxuICB9XG4gIGF2Z1NwZWMgPSBzdW1TcGVjLzVcbiAgXG4gIFxuICBcbiAgXG4gIHN1bVByZWM9MFxuICBcbiAgZm9yIChpIGluIDE6NSkge1xuICAgc3VtUHJlYyA9IHN1bVByZWMgKyBtYXRyaXgkYnlDbGFzc1tpLDNdIFxuICB9XG4gIGF2Z1ByZWMgPSBzdW1QcmVjLzVcbiAgXG4gIFxuICBcbiAgXG4gIGF2Z3MgPSBkYXRhLmZyYW1lKFNlbnNpdGl2aXR5PWF2Z1NlbiAsIFNwZWNpZmljaXR5PWF2Z1NwZWMsIFByZWNpc2lvbj1hdmdQcmVjICxBY2N1cmFjeT0gdW5uYW1lKCBtYXRyaXgkb3ZlcmFsbFsxXSkgKVxuICBwcmludChhdmdzKVxuICBcbiAgXG59XG5cblxuYGBgXG5gYGAifQ== -->

```r
```r


macro = function(matrix){
  
  sumSen=0
  
  for (i in 1:5) {
   sumSen = sumSen + matrix$byClass[i,1] 
  }
  
  
  avgSen = sumSen/5
  
  sumSpec=0
  
  for (i in 1:5) {
   sumSpec = sumSpec + matrix$byClass[i,2] 
  }
  avgSpec = sumSpec/5
  
  
  
  
  sumPrec=0
  
  for (i in 1:5) {
   sumPrec = sumPrec + matrix$byClass[i,3] 
  }
  avgPrec = sumPrec/5
  
  
  
  
  avgs = data.frame(Sensitivity=avgSen , Specificity=avgSpec, Precision=avgPrec ,Accuracy= unname( matrix$overall[1]) )
  print(avgs)
  
  
}

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


#### Gini index

Gini index measures the impurity of the dataset. The partitioning that yields the most substantial reduction in impurity is selected as the splitting attribute. To apply the Gini index, we will employ the "rpart" method, which utilizes the Gini index as the criteria for splitting.

##### 10 Folds

The tree of the gini index using 10 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5zZXQuc2VlZCgxMClcbmN0cmwgPC0gdHJhaW5Db250cm9sKG1ldGhvZCA9IFxcY3ZcXCwgbnVtYmVyID0gMTAsIHJldHVyblJlc2FtcD1cXGFsbFxcLCBzYXZlUHJlZGljdGlvbnM9XFxmaW5hbFxcKVxuXG50dW5lR3JpZCA8LSBleHBhbmQuZ3JpZChjcCA9IGMoMC4wMDEsIDAuMDA1LCAwLjAxKSlcblxuZ2luaUluZGV4MTAgPC0gdHJhaW4oXG4gIHNhbGFyeV9pbl91c2QgfiAuLFxuICBkYXRhID0gYmFsYW5jZWRfZGF0YXNldCxcbiAgbWV0aG9kID0gXFxycGFydFxcLFxuICB0ckNvbnRyb2wgPSBjdHJsLHR1bmVHcmlkPXR1bmVHcmlkLFxuICBjb250cm9sID0gbGlzdChcbiAgICBtaW5zcGxpdCA9IDEwLFxuICAgIG1pbmJ1Y2tldCA9IDUsXG4gICAgeHZhbCA9IDEwLFxuICAgIGNwID0gMC4wMDAxXG4gIClcblxuKVxuXG5cbnBycChnaW5pSW5kZXgxMCRmaW5hbE1vZGVsLCBib3gucGFsZXR0ZSA9IFxcUmVkc1xcLCB0d2VhayA9IDEuMiwgdmFybGVuID0gMjApXG5cbmBgYFxuYGBgIn0= -->

```r
```r

set.seed(10)
ctrl <- trainControl(method = \cv\, number = 10, returnResamp=\all\, savePredictions=\final\)

tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))

giniIndex10 <- train(
  salary_in_usd ~ .,
  data = balanced_dataset,
  method = \rpart\,
  trControl = ctrl,tuneGrid=tuneGrid,
  control = list(
    minsplit = 10,
    minbucket = 5,
    xval = 10,
    cp = 0.0001
  )

)


prp(giniIndex10$finalModel, box.palette = \Reds\, tweak = 1.2, varlen = 20)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the "experince level" attribute was selected as the first splitting attribute meaning that it has the largest impurity reduction.

###### Confusion matrix of 10 folds using Gini Index

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5naW5pSW5kZXgxMGNtID0gY2FyZXQ6OmNvbmZ1c2lvbk1hdHJpeChnaW5pSW5kZXgxMCRwcmVkJG9icyxnaW5pSW5kZXgxMCRwcmVkJHByZWQpXG5cbmdpbmlJbmRleDEwY21cblxuYGBgXG5gYGAifQ== -->

```r
```r

giniIndex10cm = caret::confusionMatrix(giniIndex10$pred$obs,giniIndex10$pred$pred)

giniIndex10cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the metrics shown for each class indicate the value of that metric when treating this class as the positive class and the other classes as the negative class. here the classifier showed best performance when using the "very high" class as the positive class but this value in its own doesn't hold much value since all classes should be taken into consideration.

##### 5 Folds

The tree of the gini index using 5 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDUsIHJldHVyblJlc2FtcD1cXGFsbFxcLCBzYXZlUHJlZGljdGlvbnM9XFxmaW5hbFxcKVxuXG5cbnR1bmVHcmlkIDwtIGV4cGFuZC5ncmlkKGNwID0gYygwLjAwMSwgMC4wMDUsIDAuMDEpKVxuXG5naW5pSW5kZXg1IDwtIHRyYWluKHNhbGFyeV9pbl91c2QgfiAuLCBkYXRhID0gYmFsYW5jZWRfZGF0YXNldCwgbWV0aG9kID0gXFxycGFydFxcLCB0ckNvbnRyb2wgPSBjdHJsLHR1bmVHcmlkPXR1bmVHcmlkLFxuICBjb250cm9sID0gbGlzdChcbiAgICBtaW5zcGxpdCA9IDEwLFxuICAgIG1pbmJ1Y2tldCA9IDUsXG4gICAgeHZhbCA9IDEwLFxuICAgIGNwID0gMC4wMDAxXG4gICkpXG5cbnBycChnaW5pSW5kZXg1JGZpbmFsTW9kZWwsIGJveC5wYWxldHRlID0gXFxSZWRzXFwsIHR3ZWFrID0gMS41LCB2YXJsZW4gPSAxMCwgY2V4ID0gMC4xNSlcblxuXG5gYGBcbmBgYCJ9 -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 5, returnResamp=\all\, savePredictions=\final\)


tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))

giniIndex5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \rpart\, trControl = ctrl,tuneGrid=tuneGrid,
  control = list(
    minsplit = 10,
    minbucket = 5,
    xval = 10,
    cp = 0.0001
  ))

prp(giniIndex5$finalModel, box.palette = \Reds\, tweak = 1.5, varlen = 10, cex = 0.15)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


this tree has the same structure as the previous tree that used 10 folds. so in this tree as well "experience level" was choose as the first splitting attribute

###### Confusion matrix of 5 folds using Gini Index

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZ2luaUluZGV4NWNtID0gY2FyZXQ6OmNvbmZ1c2lvbk1hdHJpeChnaW5pSW5kZXg1JHByZWQkb2JzLGdpbmlJbmRleDUkcHJlZCRwcmVkKVxuXG5naW5pSW5kZXg1Y21cblxuYGBgXG5gYGAifQ== -->

```r
```r
giniIndex5cm = caret::confusionMatrix(giniIndex5$pred$obs,giniIndex5$pred$pred)

giniIndex5cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the results are very close to the 10 folds tree, so here as well the classifier shows better performance when dealing with the "very high"

##### 3 Folds

The tree of the gini index using 3 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDMsIHJldHVyblJlc2FtcD1cXGFsbFxcLCBzYXZlUHJlZGljdGlvbnM9XFxmaW5hbFxcKVxuXG5cbnR1bmVHcmlkIDwtIGV4cGFuZC5ncmlkKGNwID0gYygwLjAwMSwgMC4wMDUsIDAuMDEpKVxuXG5naW5pSW5kZXgzIDwtIHRyYWluKHNhbGFyeV9pbl91c2QgfiAuLCBkYXRhID0gYmFsYW5jZWRfZGF0YXNldCwgbWV0aG9kID0gXFxycGFydFxcLCB0ckNvbnRyb2wgPSBjdHJsLHR1bmVHcmlkPXR1bmVHcmlkLFxuICBjb250cm9sID0gbGlzdChcbiAgICBtaW5zcGxpdCA9IDEwLFxuICAgIG1pbmJ1Y2tldCA9IDUsXG4gICAgeHZhbCA9IDEwLFxuICAgIGNwID0gMC4wMDAxXG4gICkpXG5cbnBycChnaW5pSW5kZXgzJGZpbmFsTW9kZWwsIGJveC5wYWxldHRlID0gXFxSZWRzXFwsIHR3ZWFrID0gMS41LCB2YXJsZW4gPSAxMCwgY2V4ID0gMC4xNSlcblxuXG5gYGBcbmBgYCJ9 -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 3, returnResamp=\all\, savePredictions=\final\)


tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))

giniIndex3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \rpart\, trControl = ctrl,tuneGrid=tuneGrid,
  control = list(
    minsplit = 10,
    minbucket = 5,
    xval = 10,
    cp = 0.0001
  ))

prp(giniIndex3$finalModel, box.palette = \Reds\, tweak = 1.5, varlen = 10, cex = 0.15)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


The tree shows similar structure as the two previous two trees, whether it's in its choose of the splitting attributes or the leaves.

###### Confusion matrix of 3 folds using Gini Index

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5naW5pSW5kZXgzY20gPSBjYXJldDo6Y29uZnVzaW9uTWF0cml4KGdpbmlJbmRleDMkcHJlZCRvYnMsZ2luaUluZGV4MyRwcmVkJHByZWQpXG5cbmdpbmlJbmRleDNjbVxuXG5gYGBcbmBgYCJ9 -->

```r
```r

giniIndex3cm = caret::confusionMatrix(giniIndex3$pred$obs,giniIndex3$pred$pred)

giniIndex3cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


here as well the "very high" class has the best overall performance

###### Analysis of the gini index classification

All three trees seem to be alike in their arrangement and form.

1.  Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the "experience level." This divides the tree into two main branches or subtrees:
    -   Right Subtree: This comprises instances with Senior (SE) and Executive (EX) experience levels.
    -   Left Subtree: This includes individuals with Entry (EN) and Mid (MI) experience levels.
2.  Right Subtree - work year: The next attribute used to further classify the right subtree is "work year." The decision criterion is:
    -   If work year is \<-1.8: Then it is high.
    -   If work year is NOT \<-1.8: The next attribute examined is "experience level."
3.  Left Subtree - Experience Level: On the left side of the tree, the attribute "experience level." is used to further bifurcate the instances:
    -   If experience level is \>=2: The next attribute examined is "experience level."
    -   If experience level is NOT \>=2: The next attribute also will examined is "experience level."

###### Sensitivity, Accuracy, Specifity and precision of all 3,5 and 10 folds using Gini Index


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxucmJpbmQoXFwxMCBGb2xkc1xcPW1hY3JvKGdpbmlJbmRleDEwY20pLCBcXDUgRm9sZHNcXD1tYWNybyhnaW5pSW5kZXg1Y20pLCBcXDMgRm9sZHNcXD1tYWNybyhnaW5pSW5kZXgzY20pICApIFxuYGBgXG5gYGAifQ== -->

```r
```r
rbind(\10 Folds\=macro(giniIndex10cm), \5 Folds\=macro(giniIndex5cm), \3 Folds\=macro(giniIndex3cm)  ) 

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


The higher values for sensitivity, specificity, precision, and accuracy in the 10-fold case indicate better overall performance according to these metrics. so,Gini Index model performs better with a 10-fold cross-validation compared to 5 and 3 folds.

#### Gain ratio

The gain ratio, a normalized measure of information gain, is calculated by dividing information gain by the split information. The attribute that yields the highest gain ratio is chosen as the splitting attribute. The C4.5 algorithm employs the gain ratio.

The J48 is the Java-based open-source implementation of the C4.5 algorithm, and it is included in the Weka package. This implementation allows users to conveniently apply the C4.5 decision tree.

##### 10 Folds

The tree of the gain ratio using 10 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDEwLCByZXR1cm5SZXNhbXA9XFxhbGxcXCwgc2F2ZVByZWRpY3Rpb25zPVxcZmluYWxcXClcbmdhaW5SYXRpbzEwIDwtIHRyYWluKHNhbGFyeV9pbl91c2QgfiAuLCBkYXRhID0gYmFsYW5jZWRfZGF0YXNldCwgbWV0aG9kID0gXFxKNDhcXCx0ckNvbnRyb2wgPSBjdHJsKVxucGxvdChnYWluUmF0aW8xMCRmaW5hbE1vZGVsKVxuYGBgXG5gYGAifQ== -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 10, returnResamp=\all\, savePredictions=\final\)
gainRatio10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \J48\,trControl = ctrl)
plot(gainRatio10$finalModel)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the first splitting attribute that was choosen is the "Expeirence level" attribute meaning that it probably has a high information gain and low splitInfo(Entropy of distribution of tuples into partition)

###### Confusion matrix of 10 folds using Gain ratio

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZ2FpblJhdGlvMTBjbSA9IGNhcmV0Ojpjb25mdXNpb25NYXRyaXgoZ2FpblJhdGlvMTAkcHJlZCRvYnMsIGdhaW5SYXRpbzEwJHByZWQkcHJlZClcblxuZ2FpblJhdGlvMTBjbVxuXG5cbmBgYFxuYGBgIn0= -->

```r
```r
gainRatio10cm = caret::confusionMatrix(gainRatio10$pred$obs, gainRatio10$pred$pred)

gainRatio10cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


here the classifier shows better performance when treating "very high" and "very low" attributes as positive class. since the "very high" class is better in Sensitivity and "very low" is better in Specificity and precision (Pos Pred Value)

##### 5 Folds

The tree of the gain ratio using 5 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDUsIHJldHVyblJlc2FtcD1cXGFsbFxcLCBzYXZlUHJlZGljdGlvbnM9XFxmaW5hbFxcKVxuZ2FpblJhdGlvNSA8LSB0cmFpbihzYWxhcnlfaW5fdXNkIH4gLiwgZGF0YSA9IGJhbGFuY2VkX2RhdGFzZXQsIG1ldGhvZCA9IFxcSjQ4XFwsdHJDb250cm9sID0gY3RybClcbnBsb3QoZ2FpblJhdGlvNSRmaW5hbE1vZGVsKVxuYGBgXG5gYGAifQ== -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 5, returnResamp=\all\, savePredictions=\final\)
gainRatio5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \J48\,trControl = ctrl)
plot(gainRatio5$finalModel)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the tree is similar to the tree that was resulted from 10 folds. it has choose "Experience level" as the first splitting attribute and and seem to show similar behavior.

###### Confusion matrix of 5 folds using Gain ratio

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5nYWluUmF0aW81Y209Y2FyZXQ6OmNvbmZ1c2lvbk1hdHJpeChnYWluUmF0aW81JHByZWQkb2JzLCBnYWluUmF0aW81JHByZWQkcHJlZClcblxuZ2FpblJhdGlvNWNtXG5cbmBgYFxuYGBgIn0= -->

```r
```r

gainRatio5cm=caret::confusionMatrix(gainRatio5$pred$obs, gainRatio5$pred$pred)

gainRatio5cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


unlike the 10 folds, here the classifier has the best overall performance when considering the "very high" as the positive class.

##### 3 Folds

The tree of the gain ratio using 3 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDMsIHJldHVyblJlc2FtcD1cXGFsbFxcLCBzYXZlUHJlZGljdGlvbnM9XFxmaW5hbFxcKVxuZ2FpblJhdGlvMyA8LSB0cmFpbihzYWxhcnlfaW5fdXNkIH4gLiwgZGF0YSA9IGJhbGFuY2VkX2RhdGFzZXQsIG1ldGhvZCA9IFxcSjQ4XFwsdHJDb250cm9sID0gY3RybClcbnBsb3QoZ2FpblJhdGlvMyRmaW5hbE1vZGVsKVxuYGBgXG5gYGAifQ== -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 3, returnResamp=\all\, savePredictions=\final\)
gainRatio3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \J48\,trControl = ctrl)
plot(gainRatio3$finalModel)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the tree shows similar behavior as the previous 2 trees that resulted from using 10 and 5 folds.

###### Confusion matrix of 3 folds using Gain ratio

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuZ2FpblJhdGlvM2NtPWNhcmV0Ojpjb25mdXNpb25NYXRyaXgoZ2FpblJhdGlvMyRwcmVkJG9icywgZ2FpblJhdGlvMyRwcmVkJHByZWQpXG5cbmdhaW5SYXRpbzNjbVxuXG5gYGBcbmBgYCJ9 -->

```r
```r
gainRatio3cm=caret::confusionMatrix(gainRatio3$pred$obs, gainRatio3$pred$pred)

gainRatio3cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


similar to teh 5 folds, the "very high" class has the best metrics.

###### Analysis of the gain ratio classification

The observed structure of the three decision trees seems to be the same and it can be summarized as follows:

1.  Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the "experience level." This divides the tree into two main branches or subtrees:
    -   Right Subtree: This comprises instances with Senior (SE) and Executive (EX) experience levels.
    -   Left Subtree: This includes individuals with Entry (EN) and Mid (MI) experience levels.
2.  Within the right subtree:
    -   If the 'experience level' is 4 (EX, Executive level) , the tree splits based on the 'Employee_residence' attribute. It checks whether the 'Employee_residence' is 'Latin America.'
    -   If 'Employee_residence' does not equal 'Latin America,' the differentiation continues with the 'remote_ratio' attribute, further dividing the tree.
3.  Within the left subtree:
    -   If the 'experience level' is 1 (EN, Entry level), the tree divide based on the 'Employee_residence' attribute, specifically checking for 'Sub-Saharan Africa.'
    -   If the 'experience level' is 2 (MI, Mid level), it also branches based on 'Employee_residence,' but in this case, looking to see if it equals 'North America.'

The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.

###### Sensitivity, Accuracy, Specifity and precision of all 3,5 and 10 folds using Gain ratio


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxucmJpbmQoXFwxMCBGb2xkc1xcPW1hY3JvKGdhaW5SYXRpbzEwY20pLCBcXDUgRm9sZHNcXD1tYWNybyhnYWluUmF0aW81Y20pLCBcXDMgRm9sZHNcXD1tYWNybyhnYWluUmF0aW8zY20pICApIFxuYGBgXG5gYGAifQ== -->

```r
```r
rbind(\10 Folds\=macro(gainRatio10cm), \5 Folds\=macro(gainRatio5cm), \3 Folds\=macro(gainRatio3cm)  ) 

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


Based on the evaluation metrics of average Sensitivity,Precision ,Specificity, and Accuracy, it is evident that the gain ratio model, built using a 10-fold cross-validation approach, exhibits superior performance compared to the other two models. However, it's worth noting that the difference in performance between the models is relatively small.

A detailed examination of the results from the 10-fold cross-validation reveals that the model has a notably high specificity compared to other metrics. This high specificity suggests that the model is particularly effective at correctly identifying instances that do not pertain to the target class---essentially, it accurately recognizes when examples are not members of the specified class. For example, if the positive class in question is "High" then the model is able to correctly classify tuples that belong to "Very Low", "Medium", and "Very High".

However, possessing high specificity alone does not guarantee the overall effectiveness of the model, as a well-rounded model also requires balanced performance across other metrics. In this case, its ability to capture and classify instances that do belong to the positive class (as measured by sensitivity) is not as robust. For a model to be considered truly effective, it would need to demonstrate strong performance in all metrics specificity and sensitivity, ensuring it can accurately distinguish both negative and positive instances as well as accuracy precision.

#### Information gain

Information Gain is a metric used to decide which attribute to choose for splitting the data at each node in the decision tree. For a given dataset, the Information Gain of an attribute is calculated by comparing the entropy before and after the dataset is split based on that attribute. The attribute with the highest Information Gain is chosen as the splitting attribute.

##### 10 Folds

The tree of the information gain using 10 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDEwLCByZXR1cm5SZXNhbXA9XFxhbGxcXCwgc2F2ZVByZWRpY3Rpb25zPVxcZmluYWxcXClcblxuXG5pbmZvR2FpbjEwIDwtIHRyYWluKHNhbGFyeV9pbl91c2QgfiAuLCBkYXRhID0gYmFsYW5jZWRfZGF0YXNldCwgbWV0aG9kID0gXFxDNS4wXFwsdHJDb250cm9sID0gY3RybClcblxuYzVtb2RlbCA8LSBDNS4wKHNhbGFyeV9pbl91c2QgfiAuLFxuICAgICAgICAgICAgICAgICAgICAgICBkYXRhID0gYmFsYW5jZWRfZGF0YXNldCxcbiAgICAgICAgICAgICAgICAgICAgICAgdHJpYWxzID0gaW5mb0dhaW4xMCRiZXN0VHVuZSR0cmlhbHMsIFxuICAgICAgICAgICAgICAgICAgICAgICBydWxlcyA9IEZBTFNFLFxuICAgICAgICAgICAgICAgICAgICAgICBjb250cm9sID0gQzUuMENvbnRyb2wod2lubm93ID0gaW5mb0dhaW4xMCRiZXN0VHVuZSR3aW5ub3cpKVxucGxvdChjNW1vZGVsKVxuYGBgXG5gYGAifQ== -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 10, returnResamp=\all\, savePredictions=\final\)


infoGain10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \C5.0\,trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = balanced_dataset,
                       trials = infoGain10$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain10$bestTune$winnow))
plot(c5model)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


from the tree, the "experince level" attribute was the first selected splitting attribute meaning that it has the highest information gain among all attributes

###### Confusion matrix of 10 folds using Information gain

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuaW5mb0dhaW4xMGNtPSBjYXJldDo6Y29uZnVzaW9uTWF0cml4KGluZm9HYWluMTAkcHJlZCRvYnMsIGluZm9HYWluMTAkcHJlZCRwcmVkKVxuXG5pbmZvR2FpbjEwY21cblxuYGBgXG5gYGAifQ== -->

```r
```r
infoGain10cm= caret::confusionMatrix(infoGain10$pred$obs, infoGain10$pred$pred)

infoGain10cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


similar to the trees from the gini index and gain ratio, the classifier seem to have better performance when treating the "very high" class as the positive class

##### 5 Folds

The tree of the information gain using 5 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDUsIHJldHVyblJlc2FtcD1cXGFsbFxcLCBzYXZlUHJlZGljdGlvbnM9XFxmaW5hbFxcKVxuXG5cbmluZm9HYWluNSA8LSB0cmFpbihzYWxhcnlfaW5fdXNkIH4gLiwgZGF0YSA9IGJhbGFuY2VkX2RhdGFzZXQsIG1ldGhvZCA9IFxcQzUuMFxcLHRyQ29udHJvbCA9IGN0cmwpXG5cbmM1bW9kZWwgPC0gQzUuMChzYWxhcnlfaW5fdXNkIH4gLixcbiAgICAgICAgICAgICAgICAgICAgICAgZGF0YSA9IGJhbGFuY2VkX2RhdGFzZXQsXG4gICAgICAgICAgICAgICAgICAgICAgIHRyaWFscyA9IGluZm9HYWluNSRiZXN0VHVuZSR0cmlhbHMsIFxuICAgICAgICAgICAgICAgICAgICAgICBydWxlcyA9IEZBTFNFLFxuICAgICAgICAgICAgICAgICAgICAgICBjb250cm9sID0gQzUuMENvbnRyb2wod2lubm93ID0gaW5mb0dhaW41JGJlc3RUdW5lJHdpbm5vdykpXG5wbG90KGM1bW9kZWwpXG5gYGBcbmBgYCJ9 -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 5, returnResamp=\all\, savePredictions=\final\)


infoGain5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \C5.0\,trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = balanced_dataset,
                       trials = infoGain5$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain5$bestTune$winnow))
plot(c5model)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the tree has similar behavior as the 10 folds information gain tree

###### Confusion matrix of 5 folds using Information gain

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuaW5mb0dhaW41Y20gPSBjYXJldDo6Y29uZnVzaW9uTWF0cml4KGluZm9HYWluNSRwcmVkJG9icywgaW5mb0dhaW41JHByZWQkcHJlZClcblxuaW5mb0dhaW41Y21cblxuYGBgXG5gYGAifQ== -->

```r
```r
infoGain5cm = caret::confusionMatrix(infoGain5$pred$obs, infoGain5$pred$pred)

infoGain5cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the classifier shows very close performance to the 10 folds information gain model

##### 3 Folds

The tree of the information gain using 3 folds


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuc2V0LnNlZWQoMTApXG5jdHJsIDwtIHRyYWluQ29udHJvbChtZXRob2QgPSBcXGN2XFwsIG51bWJlciA9IDMsIHJldHVyblJlc2FtcD1cXGFsbFxcLCBzYXZlUHJlZGljdGlvbnM9XFxmaW5hbFxcKVxuXG5cbmluZm9HYWluMyA8LSB0cmFpbihzYWxhcnlfaW5fdXNkIH4gLiwgZGF0YSA9IGJhbGFuY2VkX2RhdGFzZXQsIG1ldGhvZCA9IFxcQzUuMFxcLHRyQ29udHJvbCA9IGN0cmwpXG5cbmM1bW9kZWwgPC0gQzUuMChzYWxhcnlfaW5fdXNkIH4gLixcbiAgICAgICAgICAgICAgICAgICAgICAgZGF0YSA9IGJhbGFuY2VkX2RhdGFzZXQsXG4gICAgICAgICAgICAgICAgICAgICAgIHRyaWFscyA9IGluZm9HYWluMyRiZXN0VHVuZSR0cmlhbHMsIFxuICAgICAgICAgICAgICAgICAgICAgICBydWxlcyA9IEZBTFNFLFxuICAgICAgICAgICAgICAgICAgICAgICBjb250cm9sID0gQzUuMENvbnRyb2wod2lubm93ID0gaW5mb0dhaW4zJGJlc3RUdW5lJHdpbm5vdykpXG5wbG90KGM1bW9kZWwpXG5gYGBcbmBgYCJ9 -->

```r
```r
set.seed(10)
ctrl <- trainControl(method = \cv\, number = 3, returnResamp=\all\, savePredictions=\final\)


infoGain3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = \C5.0\,trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = balanced_dataset,
                       trials = infoGain3$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain3$bestTune$winnow))
plot(c5model)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


the visisble parts of the tree seem to behave the same as the prevoius 2 fold sizes- 10 and 5.

###### Confusion matrix of 3 folds using Information gain

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuaW5mb0dhaW4zY20gPSBjYXJldDo6Y29uZnVzaW9uTWF0cml4KGluZm9HYWluMyRwcmVkJG9icywgaW5mb0dhaW4zJHByZWQkcHJlZClcblxuaW5mb0dhaW4zY21cblxuYGBgXG5gYGAifQ== -->

```r
```r
infoGain3cm = caret::confusionMatrix(infoGain3$pred$obs, infoGain3$pred$pred)

infoGain3cm

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


since the tree is essentially similar to the previous two information gain trees the results that this tree shows is very close in performance to them as well.

###### Analysis of the information gain classification

The observed structure of the three decision trees seems to be the same and it can be summarized as follows:

1.  Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the "experience level." This divides the tree into two main branches or subtrees:

    -   Right Subtree: This comprises instances with Senior (SE) and Executive (EX) experience levels.
    -   Left Subtree: This includes individuals with Entry (EN) and Mid (MI) experience levels.

2.  Within the right subtree: In the right sub tree if the experience level is 4(EX) the tree will be divided based on "Company location"

3.  Within the left subtree: In the left subtree it will divide the tree for both two experience levels 1(EN) and 2(MI) based on "employee residence" and when the "employee residence" is "North America" the tree will be further divided based on "salary currency" and when this attribute is equal to "USD" the division will be based on the "job title" attribute

The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.

###### Sensitivity, Accuracy, Specifity and precision of all 3,5 and 10 folds using Information gain


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxucmJpbmQoXFwxMCBGb2xkc1xcPW1hY3JvKGluZm9HYWluMTBjbSksIFxcNSBGb2xkc1xcPW1hY3JvKGluZm9HYWluNWNtKSwgXFwzIEZvbGRzXFw9bWFjcm8oaW5mb0dhaW4zY20pICApIFxuYGBgXG5gYGAifQ== -->

```r
```r
rbind(\10 Folds\=macro(infoGain10cm), \5 Folds\=macro(infoGain5cm), \3 Folds\=macro(infoGain3cm)  ) 

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


Based on the provided sensitivity, specificity, precision, and accuracy values there isn't a clear indication of the superiority of one fold over another for Information Gain model .we may need to consider additional factors or conduct further analysis to make a well-informed decision. as can be seen in the table the 10 folds has the best Specificity and Precision, meanwhile the 5 folds has the best Sensitivity and Accuracy.

### Clustering

Data clustering is a process to partition data into groups or clusters,it is an unsupervised learning process, which is excuted without knowing the class label of the training data. The data in the same group "cluster" are similar to one another and different from data in other clusters. And for this data mining task We will utilize k-means clustering.

#### 1- prepreocessing

we will encode the rest of factor columns to transform them into numeric types before clustering, enabling meaningful distance calculations using kmeans and other formulas, and allowing for maximum flexibility in data processing and interpretation. we will also remove the class label from the dataset as clustering is an unsupervised learning process, and we will preserve this class label in an attribute for later use. lastly, we will scale all numeric attributes in the dataset so they will be standarized.


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG4jIHZpZXcgZGF0YVxuXG5kYXRhc2V0MyA8LSBkYXRhc2V0MlxuVmlldyhkYXRhc2V0MylcblxuIyBSZXNlcnZlIHRoZSBzYWxhcnlfaW5fdXNkICh0aGUgY2xhc3MgbGFiZWwpIGNvbHVtbiBpbiBhbiBhdHRyaWJ1dGUgYmVmb3JlIHJlbW92aW5nIGl0IGZyb20gdGhlIGRhdGFzZXQgZm9yIGNsdXN0ZXJpbmdcblxuY2xhc3NMYWJlbCA8LSBkYXRhc2V0M1ssIDVdIFxuXG5cbiMgUmVtb3ZlIHRoZSBjbGFzcyBsYWJsZSBmcm9tIHRoZSBkYXRhc2V0XG5cbmRhdGFzZXQzIDwtIGRhdGFzZXQzWywgLTVdXG5cbiMgZW5jb2Rpbmcgam9iX3RpdGxlIHZhcmlhYmxlXG5cbmRhdGFzZXQzJGpvYl90aXRsZSA9IGZhY3RvcihkYXRhc2V0MyRqb2JfdGl0bGUsIGxldmVscz1jKFxcQW5hbHlzdFxcLCBcXEFyY2hpdGVjdFxcLCBcXEVuZ2luZWVyXFwsIFxcTGVhZGVyc2hpcFxcLCBcXENvbnN1bHRhbnQvU3BlY2lhbGlzdFxcLFxcQ3liZXIgU2VjdXJpdHlcXCxcXE90aGVyc1xcICksIGxhYmVscz1jKDQsMSwyLDUsMyw2LDcpKVxuXG4jIGVuY29kaW5nIHNhbGFyeV9jdXJyZW5jeSB2YXJpYWJsZVxuXG5kYXRhc2V0MyRzYWxhcnlfY3VycmVuY3kgPSBmYWN0b3IoZGF0YXNldDMkc2FsYXJ5X2N1cnJlbmN5LCBsZXZlbHM9YyhcXFVTRFxcLFxcQlJMXFwsXFxHQlBcXCxcXEVVUlxcLFxcSU5SXFwsXFxDQURcXCxcXENIRlxcLFxcREtLXFwsXFxTR0RcXCxcXEFVRFxcLFxcU0VLXFwsXFxNWE5cXCxcXElMU1xcLFxcUExOXFwsXFxOT0tcXCxcXElEUlxcLFxcTlpEXFwsXFxIVUZcXCxcXFpBUlxcLFxcVFdEXFwsXFxSVUJcXCksIGxhYmVscz1jKDEsMiwzLDQsNSw2LDcsOCw5LDEwLDExLDEyLDEzLDE0LDE1LDE2LDE3LDE4LDE5LDIwLDIxKSlcblxuIyBlbmNvZGluZyBlbXBsb3llZV9yZXNpZGVuY2UgdmFyaWFibGVcblxuZGF0YXNldDMkZW1wbG95ZWVfcmVzaWRlbmNlID0gZmFjdG9yKGRhdGFzZXQzJGVtcGxveWVlX3Jlc2lkZW5jZSwgbGV2ZWxzPWMoXFxOb3J0aCBBbWVyaWNhXFwsXFxMYXRpbiBBbWVyaWNhICYgQ2FyaWJiZWFuXFwsXFxTdWItU2FoYXJhbiBBZnJpY2FcXCwgXFxFdXJvcGUgJiBDZW50cmFsIEFzaWFcXCxcXEVhc3QgQXNpYSAmIFBhY2lmaWNcXCxcXFNvdXRoIEFzaWFcXCxcXE1pZGRsZSBFYXN0ICYgTm9ydGggQWZyaWNhXFwpLCBsYWJlbHM9YygxLDIsMyw0LDUsNiw3KSlcblxuIyBlbmNvZGluZyBjb21wYW55X2xvY2F0aW9uIHZhcmlhYmxlXG5cbmRhdGFzZXQzJGNvbXBhbnlfbG9jYXRpb24gPSBmYWN0b3IoZGF0YXNldDMkY29tcGFueV9sb2NhdGlvbiwgbGV2ZWxzPWMoXFxOb3J0aCBBbWVyaWNhXFwsXFxMYXRpbiBBbWVyaWNhICYgQ2FyaWJiZWFuXFwsXFxTdWItU2FoYXJhbiBBZnJpY2FcXCwgXFxFdXJvcGUgJiBDZW50cmFsIEFzaWFcXCxcXEVhc3QgQXNpYSAmIFBhY2lmaWNcXCxcXFNvdXRoIEFzaWFcXCxcXE1pZGRsZSBFYXN0ICYgTm9ydGggQWZyaWNhXFwsIFxcQVFcXCwgXFxVTVxcKSwgbGFiZWxzPWMoMSwyLDMsNCw1LDYsNyw4LDkpKVxuXG5cbiBcbiNEYXRhIHR5cGVzIHRvIGJlIHRyYW5zZm9ybWVkIGludG8gbnVtZXJpYyB0eXBlcyBiZWZvcmUgY2x1c3RlcmluZ1xuI1RyYW5zZm9ybWluZyBhbGwgbm9uLW51bWVyaWMgYXR0cmlidXRlcyB0byBudW1lcmljIHR5cGVcblxuXG5kYXRhc2V0MyRleHBlcmllbmNlX2xldmVsIDwtIGFzLm51bWVyaWMoYXMuY2hhcmFjdGVyKGRhdGFzZXQzJGV4cGVyaWVuY2VfbGV2ZWwpKVxuXG5kYXRhc2V0MyRqb2JfdGl0bGUgPC0gYXMubnVtZXJpYyhhcy5jaGFyYWN0ZXIoZGF0YXNldDMkam9iX3RpdGxlKSlcblxuZGF0YXNldDMkc2FsYXJ5X2N1cnJlbmN5IDwtIGFzLm51bWVyaWMoYXMuY2hhcmFjdGVyKGRhdGFzZXQzJHNhbGFyeV9jdXJyZW5jeSkpXG5cbmRhdGFzZXQzJGVtcGxveWVlX3Jlc2lkZW5jZSA8LSBhcy5udW1lcmljKGFzLmNoYXJhY3RlcihkYXRhc2V0MyRlbXBsb3llZV9yZXNpZGVuY2UpKVxuXG5kYXRhc2V0MyRjb21wYW55X2xvY2F0aW9uIDwtIGFzLm51bWVyaWMoYXMuY2hhcmFjdGVyKGRhdGFzZXQzJGNvbXBhbnlfbG9jYXRpb24pKVxuXG5kYXRhc2V0MyRjb21wYW55X3NpemUgPC0gYXMubnVtZXJpYyhhcy5jaGFyYWN0ZXIoZGF0YXNldDMkY29tcGFueV9zaXplKSlcblxuIyB2aXdlIHRoZSBjbGFzcyBvZiBhdHRyaWJ1dGVzIHRvIGVuc3VyZSB0aGV5IGhhdmUgdHJhbnNmb3JtZWQgdG8gbnVtZXJpY1xuc2FwcGx5KGRhdGFzZXQzLCBjbGFzcylcblxuXG4jc2NhbGUgYWxsIGF0dHJpYnV0ZXMgaW4gdGhlIGRhdGFzZXQgc28gdGhleSB3b3VsZCBiZSBzdGFuZGFyZGl6ZWQgXG5kYXRhc2V0MyA8LSBzY2FsZShkYXRhc2V0MylcblxuYGBgXG5gYGAifQ== -->

```r
```r

# view data

dataset3 <- dataset2
View(dataset3)

# Reserve the salary_in_usd (the class label) column in an attribute before removing it from the dataset for clustering

classLabel <- dataset3[, 5] 


# Remove the class lable from the dataset

dataset3 <- dataset3[, -5]

# encoding job_title variable

dataset3$job_title = factor(dataset3$job_title, levels=c(\Analyst\, \Architect\, \Engineer\, \Leadership\, \Consultant/Specialist\,\Cyber Security\,\Others\ ), labels=c(4,1,2,5,3,6,7))

# encoding salary_currency variable

dataset3$salary_currency = factor(dataset3$salary_currency, levels=c(\USD\,\BRL\,\GBP\,\EUR\,\INR\,\CAD\,\CHF\,\DKK\,\SGD\,\AUD\,\SEK\,\MXN\,\ILS\,\PLN\,\NOK\,\IDR\,\NZD\,\HUF\,\ZAR\,\TWD\,\RUB\), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21))

# encoding employee_residence variable

dataset3$employee_residence = factor(dataset3$employee_residence, levels=c(\North America\,\Latin America & Caribbean\,\Sub-Saharan Africa\, \Europe & Central Asia\,\East Asia & Pacific\,\South Asia\,\Middle East & North Africa\), labels=c(1,2,3,4,5,6,7))

# encoding company_location variable

dataset3$company_location = factor(dataset3$company_location, levels=c(\North America\,\Latin America & Caribbean\,\Sub-Saharan Africa\, \Europe & Central Asia\,\East Asia & Pacific\,\South Asia\,\Middle East & North Africa\, \AQ\, \UM\), labels=c(1,2,3,4,5,6,7,8,9))


 
#Data types to be transformed into numeric types before clustering
#Transforming all non-numeric attributes to numeric type


dataset3$experience_level <- as.numeric(as.character(dataset3$experience_level))

dataset3$job_title <- as.numeric(as.character(dataset3$job_title))

dataset3$salary_currency <- as.numeric(as.character(dataset3$salary_currency))

dataset3$employee_residence <- as.numeric(as.character(dataset3$employee_residence))

dataset3$company_location <- as.numeric(as.character(dataset3$company_location))

dataset3$company_size <- as.numeric(as.character(dataset3$company_size))

# viwe the class of attributes to ensure they have transformed to numeric
sapply(dataset3, class)


#scale all attributes in the dataset so they would be standardized 
dataset3 <- scale(dataset3)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


#### 2- K-means

After preprocessing the data, now we are ready to perform the clustering process, we will use the k-means clustering, it is a clustering method that aims to minimize the sum of squared distances between each data point and the centroid of its assigned cluster by iteratively updating cluster assignments and centroids.

#### 3- Choosing number of clusters (k)

We will choose 3 different numbers to perform the k-means clustering on, one of the numbers should be relatevily large, the second should be in the middle and the last should be small. This way we will cover the possible outcomes and clustering results.

##### a- Silhouette method

Now we will apply Silhouette method to find the optimal number of clusters k, we will also plot a graph where x-axis represent the number of clusters and y-axis represent the average Silhouette coefficient


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5mdml6X25iY2x1c3QoZGF0YXNldDMsIGttZWFucywgbWV0aG9kID0gXFxzaWxob3VldHRlXFwpK1xuICBsYWJzKHN1YnRpdGxlID0gXFxTaWxob3VldHRlIG1ldGhvZFxcKVxuXG5gYGBcbmBgYCJ9 -->

```r
```r

fviz_nbclust(dataset3, kmeans, method = \silhouette\)+
  labs(subtitle = \Silhouette method\)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


as seen by the graph, the number of clusters k that maximizes the average Silhouette coefficient is 2, so we will use it for clustering.

##### b- Elbow method

This method determines the number of clusters according to the turning point in a curve, the curve is plotted using the total within-cluster sum of square (WSS) as in y-axis , and No. clusters in x-axis


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG5mdml6X25iY2x1c3QoZGF0YXNldDMsIGttZWFucywgbWV0aG9kID0gXFx3c3NcXCkgK1xuICBnZW9tX3ZsaW5lKHhpbnRlcmNlcHQgPSA0LCBsaW5ldHlwZSA9IDIpK1xuICBsYWJzKHN1YnRpdGxlID0gXFxFbGJvdyBtZXRob2RcXClcblxuYGBgXG5gYGAifQ== -->

```r
```r

fviz_nbclust(dataset3, kmeans, method = \wss\) +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = \Elbow method\)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


As shown, the number of clusters k that represents the turning point in the curve is 4, so we will use it for clustering.

Lastly, we will use k=3 since it acheives the second highest average Silhouette coefficient, and since it's in the middle between 2 and 4 it will strike a balance between having too few clusters (k=2), and having several clusters (k=4), Thus, this choice will have an acceptable acuuracy.

#### k-means clustering, visualization and evaluation

In this section, we will perform k-means clustering and visualize its result using three different k's that have been chosen beforehand, then we will compute WSS and Bcubed preceision and recall and average silhouette for each cluster as methods of evaluating clustering results.

##### k=2


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG4jVXNlIHNlZWQgdG8gZ3VhcmFudGVlIHJlcGxpY2FiaWxpdHkgb2YgcmFuZG9tIHByb2Nlc3Nlc1xuc2V0LnNlZWQoODk1MylcblxuIyBydW4gay1tZWFucyBjbHVzdGVyaW5nIHRvIGZpbmQgMiBjbHVzdGVyc1xua21lYW5zLnJlc3VsdCA8LSBrbWVhbnMoZGF0YXNldDMsIDIpXG5cbiMgcHJpbnQgdGhlIGNsdXN0ZXJuZyByZXN1bHRcbmttZWFucy5yZXN1bHRcblxuIyB2aXN1YWxpemUgY2x1c3RlcmluZ1xuZnZpel9jbHVzdGVyKGttZWFucy5yZXN1bHQsIGRhdGEgPSBkYXRhc2V0MylcblxuXG4jYXZlcmFnZSBzaWxob3VldHRlIGZvciBlYWNoIGNsdXN0ZXJzXG5hdmdfc2lsIDwtIHNpbGhvdWV0dGUoa21lYW5zLnJlc3VsdCRjbHVzdGVyLGRpc3QoZGF0YXNldDMpKSBcbmZ2aXpfc2lsaG91ZXR0ZShhdmdfc2lsKVxuXG4jV2l0aGluLWNsdXN0ZXIgc3VtIG9mIHNxdWFyZXMgd3NzIFxud3NzIDwtIGttZWFucy5yZXN1bHQkdG90LndpdGhpbnNzXG5wcmludCh3c3MpXG5cbiNCQ3ViZWRcbmttZWFuc19jbHVzdGVyIDwtIGMoa21lYW5zLnJlc3VsdCRjbHVzdGVyKVxuXG5ncm91bmRfdHJ1dGggPC0gYyhjbGFzc0xhYmVsKVxuXG5kYXRhIDwtIGRhdGEuZnJhbWUoY2x1c3RlciA9IGttZWFuc19jbHVzdGVyLCBsYWJlbCA9IGdyb3VuZF90cnV0aClcblxuXG4gIGJjdWJlZCA8LSBmdW5jdGlvbihkYXRhKSB7XG4gIG4gPC0gbnJvdyhkYXRhKVxuICB0b3RhbF9wcmVjZXNpb24gPC0gMFxuICB0b3RhbF9yZWNhbGwgPC0gMFxuXG5mb3IgKGkgaW4gMTpuKSB7XG4gIGNsdXN0ZXIgPC0gZGF0YSRjbHVzdGVyW2ldXG4gIGxhYmVsIDwtIGRhdGEkbGFiZWxbaV1cbiAgICBcbiMgTnVtYmVyIG9mIG9iamVjdHMgaW4gdGhlIHNhbWUgY2F0ZWdvcnkgYW5kIGNsdXN0ZXJcbmludGVyc2VjdGlvbiA8LSBzdW0oZGF0YSRsYWJlbFtkYXRhJGNsdXN0ZXIgPT0gY2x1c3Rlcl0gPT0gbGFiZWwpXG4gICAgXG4jIE51bWJlciBvZiBvYmplY3RzIHRoYXQgYXJlIGluIHRoZSBzYW1lIGNsdXN0ZXJcbnRvdGFsX3NhbWVfY2x1c3RlciA8LSBzdW0oZGF0YSRjbHVzdGVyID09IGNsdXN0ZXIpXG4gICAgXG4jIE51bWJlciBvZiBvYmplY3RzIHRoYXQgaGF2ZSB0aGUgc2FtZSBjYXRlZ29yeVxudG90YWxfc2FtZV9jYXRlZ29yeSA8LSBzdW0oZGF0YSRsYWJlbCA9PSBsYWJlbClcbiAgICBcblxudG90YWxfcHJlY2VzaW9uIDwtIHRvdGFsX3ByZWNlc2lvbiArIGludGVyc2VjdGlvbiAvdG90YWxfc2FtZV9jbHVzdGVyXG50b3RhbF9yZWNhbGwgPC0gdG90YWxfcmVjYWxsICsgaW50ZXJzZWN0aW9uIC8gdG90YWxfc2FtZV9jYXRlZ29yeVxuICB9XG5cbiAgIyBjb21wdXRlIGF2ZyBwcmVjaXNpb24gYW5kIHJlY2FsbFxuICBwcmVjaXNpb24gPC0gdG90YWxfcHJlY2VzaW9uIC8gblxuICByZWNhbGwgPC0gdG90YWxfcmVjYWxsIC8gblxuXG4gIHJldHVybihsaXN0KHByZWNpc2lvbiA9IHByZWNpc2lvbiwgcmVjYWxsID0gcmVjYWxsKSkgfVxuXG5cbiMgY29tcHV0ZSBCQ3ViZWQgcHJlY2lzaW9uIGFuZCByZWNhbGxcbm1ldHJpY3MgPC0gYmN1YmVkKGRhdGEpXG5cblxucHJlY2lzaW9uIDwtIG1ldHJpY3MkcHJlY2lzaW9uXG5yZWNhbGwgPC0gbWV0cmljcyRyZWNhbGxcblxuIyBQcmludCByZXN1bHRzXG5jYXQoXFxCQ3ViZWQgUHJlY2lzaW9uIGlzOlxcLCBwcmVjaXNpb24sIFxcXFxuXFwpXG5jYXQoXFxCQ3ViZWQgUmVjYWxsIGlzOlxcLCByZWNhbGwsIFxcXFxuXFwpXG5gYGBcbmBgYCJ9 -->

```r
```r

#Use seed to guarantee replicability of random processes
set.seed(8953)

# run k-means clustering to find 2 clusters
kmeans.result <- kmeans(dataset3, 2)

# print the clusterng result
kmeans.result

# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)


#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3)) 
fviz_silhouette(avg_sil)

#Within-cluster sum of squares wss 
wss <- kmeans.result$tot.withinss
print(wss)

#BCubed
kmeans_cluster <- c(kmeans.result$cluster)

ground_truth <- c(classLabel)

data <- data.frame(cluster = kmeans_cluster, label = ground_truth)


  bcubed <- function(data) {
  n <- nrow(data)
  total_precesion <- 0
  total_recall <- 0

for (i in 1:n) {
  cluster <- data$cluster[i]
  label <- data$label[i]
    
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
    
# Number of objects that are in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
    
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
    

total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
  }

  # compute avg precision and recall
  precision <- total_precesion / n
  recall <- total_recall / n

  return(list(precision = precision, recall = recall)) }


# compute BCubed precision and recall
metrics <- bcubed(data)


precision <- metrics$precision
recall <- metrics$recall

# Print results
cat(\BCubed Precision is:\, precision, \\n\)
cat(\BCubed Recall is:\, recall, \\n\)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


we can conclude from the graph and the results that the k=2 is the optimal k, since there is no overlapping between the two clusters, the data in a cluster are close "similar" to each other and dissimilar to data in the other cluster. Also, the recall is relatively high (0.71) and is the highest among the k's chosen, the Precision is low (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 7287.657, indicating a good compactness of clusters, and that objects in a cluster are similar to one another noting that the higher the k, the lower the WSS. Lastly, the average silhouette width is 0.34 which is considered high reflecting high intra-cluster similarity.

##### k=3


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuXG4jVXNlIHNlZWQgdG8gZ3VhcmFudGVlIHJlcGxpY2FiaWxpdHkgb2YgcmFuZG9tIHByb2Nlc3Nlc1xuc2V0LnNlZWQoODk1MylcblxuIyBydW4gay1tZWFucyBjbHVzdGVyaW5nIHRvIGZpbmQgMyBjbHVzdGVyc1xua21lYW5zLnJlc3VsdCA8LSBrbWVhbnMoZGF0YXNldDMsIDMpXG5cbiMgcHJpbnQgdGhlIGNsdXN0ZXJuZyByZXN1bHRcbmttZWFucy5yZXN1bHRcblxuIyB2aXN1YWxpemUgY2x1c3RlcmluZ1xuZnZpel9jbHVzdGVyKGttZWFucy5yZXN1bHQsIGRhdGEgPSBkYXRhc2V0MylcblxuI2F2ZXJhZ2Ugc2lsaG91ZXR0ZSBmb3IgZWFjaCBjbHVzdGVyc1xuYXZnX3NpbCA8LSBzaWxob3VldHRlKGttZWFucy5yZXN1bHQkY2x1c3RlcixkaXN0KGRhdGFzZXQzKSkgXG5mdml6X3NpbGhvdWV0dGUoYXZnX3NpbClcblxuI1dpdGhpbi1jbHVzdGVyIHN1bSBvZiBzcXVhcmVzIHdzcyBcbndzcyA8LSBrbWVhbnMucmVzdWx0JHRvdC53aXRoaW5zc1xucHJpbnQod3NzKVxuXG4jQkN1YmVkIFxua21lYW5zX2NsdXN0ZXIgPC0gYyhrbWVhbnMucmVzdWx0JGNsdXN0ZXIpXG5cbmdyb3VuZF90cnV0aCA8LSBjKGNsYXNzTGFiZWwpXG5cbmRhdGEgPC0gZGF0YS5mcmFtZShjbHVzdGVyID0ga21lYW5zX2NsdXN0ZXIsIGxhYmVsID0gZ3JvdW5kX3RydXRoKVxuXG4gIGJjdWJlZCA8LSBmdW5jdGlvbihkYXRhKSB7XG4gIG4gPC0gbnJvdyhkYXRhKVxuICB0b3RhbF9wcmVjZXNpb24gPC0gMFxuICB0b3RhbF9yZWNhbGwgPC0gMFxuXG4gIGZvciAoaSBpbiAxOm4pIHtcbiAgICBjbHVzdGVyIDwtIGRhdGEkY2x1c3RlcltpXVxuICAgIGxhYmVsIDwtIGRhdGEkbGFiZWxbaV1cbiAgICBcbiMgTnVtYmVyIG9mIG9iamVjdHMgaW4gdGhlIHNhbWUgY2F0ZWdvcnkgYW5kIGNsdXN0ZXJcbmludGVyc2VjdGlvbiA8LSBzdW0oZGF0YSRsYWJlbFtkYXRhJGNsdXN0ZXIgPT0gY2x1c3Rlcl0gPT0gbGFiZWwpXG4gICAgXG4jIE51bWJlciBvZiBvYmplY3RzIHRoYXQgYXJlIGluIHRoZSBzYW1lIGNsdXN0ZXJcbnRvdGFsX3NhbWVfY2x1c3RlciA8LSBzdW0oZGF0YSRjbHVzdGVyID09IGNsdXN0ZXIpXG4gICAgXG4jIE51bWJlciBvZiBvYmplY3RzIHRoYXQgaGF2ZSB0aGUgc2FtZSBjYXRlZ29yeVxudG90YWxfc2FtZV9jYXRlZ29yeSA8LSBzdW0oZGF0YSRsYWJlbCA9PSBsYWJlbClcbiAgICBcblxudG90YWxfcHJlY2VzaW9uIDwtIHRvdGFsX3ByZWNlc2lvbiArIGludGVyc2VjdGlvbiAvdG90YWxfc2FtZV9jbHVzdGVyXG50b3RhbF9yZWNhbGwgPC0gdG90YWxfcmVjYWxsICsgaW50ZXJzZWN0aW9uIC8gdG90YWxfc2FtZV9jYXRlZ29yeVxuICB9XG5cbiAgIyBjb21wdXRlIGF2ZyBwcmVjaXNpb24gYW5kIHJlY2FsbFxuICBwcmVjaXNpb24gPC0gdG90YWxfcHJlY2VzaW9uIC8gblxuICByZWNhbGwgPC0gdG90YWxfcmVjYWxsIC8gblxuXG4gIHJldHVybihsaXN0KHByZWNpc2lvbiA9IHByZWNpc2lvbiwgcmVjYWxsID0gcmVjYWxsKSlcbn1cblxuIyBjb21wdXRlIEJDdWJlZCBwcmVjaXNpb24gYW5kIHJlY2FsbFxubWV0cmljcyA8LSBiY3ViZWQoZGF0YSlcblxuXG5wcmVjaXNpb24gPC0gbWV0cmljcyRwcmVjaXNpb25cbnJlY2FsbCA8LSBtZXRyaWNzJHJlY2FsbFxuXG4jIFByaW50IHJlc3VsdHNcbmNhdChcXEJDdWJlZCBQcmVjaXNpb24gaXM6XFwsIHByZWNpc2lvbiwgXFxcXG5cXClcbmNhdChcXEJDdWJlZCBSZWNhbGwgaXM6XFwsIHJlY2FsbCwgXFxcXG5cXClcbmBgYFxuYGBgIn0= -->

```r
```r

#Use seed to guarantee replicability of random processes
set.seed(8953)

# run k-means clustering to find 3 clusters
kmeans.result <- kmeans(dataset3, 3)

# print the clusterng result
kmeans.result

# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)

#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3)) 
fviz_silhouette(avg_sil)

#Within-cluster sum of squares wss 
wss <- kmeans.result$tot.withinss
print(wss)

#BCubed 
kmeans_cluster <- c(kmeans.result$cluster)

ground_truth <- c(classLabel)

data <- data.frame(cluster = kmeans_cluster, label = ground_truth)

  bcubed <- function(data) {
  n <- nrow(data)
  total_precesion <- 0
  total_recall <- 0

  for (i in 1:n) {
    cluster <- data$cluster[i]
    label <- data$label[i]
    
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
    
# Number of objects that are in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
    
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
    

total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
  }

  # compute avg precision and recall
  precision <- total_precesion / n
  recall <- total_recall / n

  return(list(precision = precision, recall = recall))
}

# compute BCubed precision and recall
metrics <- bcubed(data)


precision <- metrics$precision
recall <- metrics$recall

# Print results
cat(\BCubed Precision is:\, precision, \\n\)
cat(\BCubed Recall is:\, recall, \\n\)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


we can conclude from the graph and the results where k=3 is that the performance is good but worse than k=2, because there is overlapping between clusters. In addition, the recall is relatively low (0.45), However, the Precision is low (0.31) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 6451.51, indicating an intermidiate compactness of clusters, and that objects in a cluster are to some extent similar to one another. Lastly, the average silhouette width is 0.19 which reflects high inter-cluster similarity.

##### k=4


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuYGBgclxuI1VzZSBzZWVkIHRvIGd1YXJhbnRlZSByZXBsaWNhYmlsaXR5IG9mIHJhbmRvbSBwcm9jZXNzZXNcbnNldC5zZWVkKDg5NTMpXG5cbiMgcnVuIGstbWVhbnMgY2x1c3RlcmluZyB0byBmaW5kIDQgY2x1c3RlcnNcbmttZWFucy5yZXN1bHQgPC0ga21lYW5zKGRhdGFzZXQzLCA0KVxuXG4jIHByaW50IHRoZSBjbHVzdGVybmcgcmVzdWx0XG5rbWVhbnMucmVzdWx0XG5cbiMgdmlzdWFsaXplIGNsdXN0ZXJpbmdcbmZ2aXpfY2x1c3RlcihrbWVhbnMucmVzdWx0LCBkYXRhID0gZGF0YXNldDMpXG5cbiNhdmVyYWdlIHNpbGhvdWV0dGUgZm9yIGVhY2ggY2x1c3RlcnNcbmF2Z19zaWwgPC0gc2lsaG91ZXR0ZShrbWVhbnMucmVzdWx0JGNsdXN0ZXIsZGlzdChkYXRhc2V0MykpIFxuZnZpel9zaWxob3VldHRlKGF2Z19zaWwpXG5cbiNXaXRoaW4tY2x1c3RlciBzdW0gb2Ygc3F1YXJlcyB3c3MgXG53c3MgPC0ga21lYW5zLnJlc3VsdCR0b3Qud2l0aGluc3NcbnByaW50KHdzcylcblxuI0JDdWJlZFxua21lYW5zX2NsdXN0ZXIgPC0gYyhrbWVhbnMucmVzdWx0JGNsdXN0ZXIpXG5cbmdyb3VuZF90cnV0aCA8LSBjKGNsYXNzTGFiZWwpXG5cbmRhdGEgPC0gZGF0YS5mcmFtZShjbHVzdGVyID0ga21lYW5zX2NsdXN0ZXIsIGxhYmVsID0gZ3JvdW5kX3RydXRoKVxuXG5cbiAgYmN1YmVkIDwtIGZ1bmN0aW9uKGRhdGEpIHtcbiAgbiA8LSBucm93KGRhdGEpXG4gIHRvdGFsX3ByZWNlc2lvbiA8LSAwXG4gIHRvdGFsX3JlY2FsbCA8LSAwXG5cbiAgZm9yIChpIGluIDE6bikge1xuICAgIGNsdXN0ZXIgPC0gZGF0YSRjbHVzdGVyW2ldXG4gICAgbGFiZWwgPC0gZGF0YSRsYWJlbFtpXVxuICAgIFxuIyBOdW1iZXIgb2Ygb2JqZWN0cyBpbiB0aGUgc2FtZSBjYXRlZ29yeSBhbmQgY2x1c3RlclxuaW50ZXJzZWN0aW9uIDwtIHN1bShkYXRhJGxhYmVsW2RhdGEkY2x1c3RlciA9PSBjbHVzdGVyXSA9PSBsYWJlbClcbiAgICBcbiMgTnVtYmVyIG9mIG9iamVjdHMgaW4gdGhlIHNhbWUgIGNsdXN0ZXJcbnRvdGFsX3NhbWVfY2x1c3RlciA8LSBzdW0oZGF0YSRjbHVzdGVyID09IGNsdXN0ZXIpXG4gICAgXG4jIE51bWJlciBvZiBvYmplY3RzIHRoYXQgaGF2ZSB0aGUgc2FtZSBjYXRlZ29yeVxudG90YWxfc2FtZV9jYXRlZ29yeSA8LSBzdW0oZGF0YSRsYWJlbCA9PSBsYWJlbClcbiAgICBcbiMgQ2FsY3VsYXRlIHByZWNpc2lvbiBhbmQgcmVjYWxsIGZvciB0aGUgY3VycmVudCBpdGVtIGFuZCBhZGQgdGhlbSB0byB0aGUgc3Vtc1xudG90YWxfcHJlY2VzaW9uIDwtIHRvdGFsX3ByZWNlc2lvbiArIGludGVyc2VjdGlvbiAvdG90YWxfc2FtZV9jbHVzdGVyXG50b3RhbF9yZWNhbGwgPC0gdG90YWxfcmVjYWxsICsgaW50ZXJzZWN0aW9uIC8gdG90YWxfc2FtZV9jYXRlZ29yeVxuICB9XG5cbiAgIyBDb21wdXRlIGF2ZyBwcmVjaXNpb24gYW5kIHJlY2FsbFxuICBwcmVjaXNpb24gPC0gdG90YWxfcHJlY2VzaW9uIC8gblxuICByZWNhbGwgPC0gdG90YWxfcmVjYWxsIC8gblxuXG4gIHJldHVybihsaXN0KHByZWNpc2lvbiA9IHByZWNpc2lvbiwgcmVjYWxsID0gcmVjYWxsKSlcbn1cblxuIyBjb21wdXRlIEJDdWJlZCBwcmVjaXNpb24gYW5kIHJlY2FsbFxubWV0cmljcyA8LSBiY3ViZWQoZGF0YSlcblxuXG5wcmVjaXNpb24gPC0gbWV0cmljcyRwcmVjaXNpb25cbnJlY2FsbCA8LSBtZXRyaWNzJHJlY2FsbFxuXG4jIFByaW50IHJlc3VsdHNcbmNhdChcXEJDdWJlZCBQcmVjaXNpb24gaXM6XFwsIHByZWNpc2lvbiwgXFxcXG5cXClcbmNhdChcXEJDdWJlZCBSZWNhbGwgaXM6XFwsIHJlY2FsbCwgXFxcXG5cXClcbmBgYFxuYGBgIn0= -->

```r
```r
#Use seed to guarantee replicability of random processes
set.seed(8953)

# run k-means clustering to find 4 clusters
kmeans.result <- kmeans(dataset3, 4)

# print the clusterng result
kmeans.result

# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)

#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3)) 
fviz_silhouette(avg_sil)

#Within-cluster sum of squares wss 
wss <- kmeans.result$tot.withinss
print(wss)

#BCubed
kmeans_cluster <- c(kmeans.result$cluster)

ground_truth <- c(classLabel)

data <- data.frame(cluster = kmeans_cluster, label = ground_truth)


  bcubed <- function(data) {
  n <- nrow(data)
  total_precesion <- 0
  total_recall <- 0

  for (i in 1:n) {
    cluster <- data$cluster[i]
    label <- data$label[i]
    
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
    
# Number of objects in the same  cluster
total_same_cluster <- sum(data$cluster == cluster)
    
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
    
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
  }

  # Compute avg precision and recall
  precision <- total_precesion / n
  recall <- total_recall / n

  return(list(precision = precision, recall = recall))
}

# compute BCubed precision and recall
metrics <- bcubed(data)


precision <- metrics$precision
recall <- metrics$recall

# Print results
cat(\BCubed Precision is:\, precision, \\n\)
cat(\BCubed Recall is:\, recall, \\n\)

<!-- rnb-source-end -->

<!-- rnb-chunk-end -->


<!-- rnb-text-begin -->


we can conclude from the graph and the results where k=4 is that the performance is worse than k=2 and k=3, because there is a noticeable overlapping between clusters. Also, the clusers' space is pretty wide which results in a large distance between objects in the same cluster. In addition, the recall is relatively low (0.43) which might be a result of the overlapping and large distances between data objects. Furthermore, the Precision is low (0.29) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 5911.05 indicating a lower compactness of clusters. Lastly, the average silhouette width is 0.22 which is low reflecting high inter-cluster similarity.




## Clustering results

This table displays the results of clustering using various methods for each K.

+----------------------------------------+-------------------+----------------+-----------------------------+
|                                        | K=2               | K=3            | K=4                         |
+:======================================:+:=================:+:==============:+:===========================:+
| **Average Silhouette width**           | 0.34              | 0.31           | 0.2                         |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **total within-cluster sum of square** | 7295.548          | 6778.046       | 5915.724                    |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **BCubed precision**                   | 0.2812713         | 0.2807635      | 0.2895803                   |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **BCubed recall**                      | 0.7064208         | 0.6786707      | 0.401187                    |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **Visualization**                      | ![](images/k2-01) | ![](images/k3) | ![](images/k4){width="145"} |
+----------------------------------------+-------------------+----------------+-----------------------------+



## Phase 4




## Findings

While working on this project, we have prepared the data to actually implement data mining techniques on it, i.e: classification and clustering. As discussed in the previous section,


this table displays all results of the three classification algorithms (Gini index, information gain and gain ratio) with 3 different fold sizes (k=3,5,10)


<!-- rnb-text-end -->


<!-- rnb-chunk-begin -->


<!-- rnb-source-begin eyJkYXRhIjoiYGBgclxuXG4gXG5cbnByaW50KCBcbiAgcmJpbmQoXCIxMCBGb2xkc1wiID0gYyhcIiBcIiwgXCIgXCIsIFwiXCIsIFwiXCIpLCBcbiAgICAgICAgXCJnaW5pIGluZGV4XCIgPSBtYWNybyhnaW5pSW5kZXgxMGNtKSwgXG4gICAgICAgIFwiR2FpbiByYXRpb1wiID0gbWFjcm8oZ2FpblJhdGlvMTBjbSksIFxuICAgICAgICBcIkluZm9ybWF0aW9uIGdhaW5cIiA9IG1hY3JvKGluZm9HYWluMTBjbSksIFxuICAgICAgICBcIjUgRm9sZHNcIiA9IGMoXCIgXCIsIFwiIFwiLCBcIlwiLCBcIlwiKSwgXG4gICAgICAgIFwiZ2luaSBpbmRleFwiID0gbWFjcm8oZ2luaUluZGV4NWNtKSwgXG4gICAgICAgIFwiR2FpbiByYXRpb1wiID0gbWFjcm8oZ2FpblJhdGlvNWNtKSwgXG4gICAgICAgIFwiSW5mb3JtYXRpb24gZ2FpblwiID0gbWFjcm8oaW5mb0dhaW41Y20pLCBcbiAgICAgICAgXCIzIEZvbGRzXCIgPSBjKFwiIFwiLCBcIiBcIiwgXCJcIiwgXCJcIiksIFxuICAgICAgICBcImdpbmkgaW5kZXhcIiA9IG1hY3JvKGdpbmlJbmRleDNjbSksIFxuICAgICAgICBcIkdhaW4gcmF0aW9cIiA9IG1hY3JvKGdhaW5SYXRpbzNjbSksIFxuICAgICAgICBcIkluZm9ybWF0aW9uIGdhaW5cIiA9IG1hY3JvKGluZm9HYWluM2NtKSlcblxuKVxuYGBgIn0= -->

```r

 

print( 
  rbind("10 Folds" = c(" ", " ", "", ""), 
        "gini index" = macro(giniIndex10cm), 
        "Gain ratio" = macro(gainRatio10cm), 
        "Information gain" = macro(infoGain10cm), 
        "5 Folds" = c(" ", " ", "", ""), 
        "gini index" = macro(giniIndex5cm), 
        "Gain ratio" = macro(gainRatio5cm), 
        "Information gain" = macro(infoGain5cm), 
        "3 Folds" = c(" ", " ", "", ""), 
        "gini index" = macro(giniIndex3cm), 
        "Gain ratio" = macro(gainRatio3cm), 
        "Information gain" = macro(infoGain3cm))

)
NA
NA
NA
NA
NA

Based of this metrics we can further assess the performance of each method as follows: - Accuracy: the accuracy of gain ratio 10 folds is 63% which is higher than the others meaning that the model has successfully classified 63% of the instances. And comes after it the Gain ratio model with 5 folds with 62% accuracy. The worst was 3-fold gini index with only 56% accuracy

If we look at each fold separately, we notice that the gain ratio has the best performance according to all 4 metrics (accuracy, precision, sensitivity and specificity), so for the 10, 5, and 3 folds the gain ratio is the best. It can also be noticed that the 3-folds gini index was the worst in all aspects but the differences with the other models isn’t that high in most of the metrics.

So if we were to choose one among them it will be the 10-fold gain ratio. The gain ratio evaluated with 10-fold cross-validation appears to have the best performance among all the decision tree models. This might be because the gain ratio tends to favor unbalanced splits, where one partition is significantly smaller than the others. In our dataset, if an attribute has a rare value, the gain ratio may prioritize splitting on this attribute, despite the resulting imbalance.

Despite the superiority of the 10-fold cross-validation, the performance metrics of all three fold sizes (3-fold, 5-fold, and 10-fold) using the Gini index, gain ratio, and information gain are relatively similar. with the gain ratio giving best performance in each partition. This suggests that all three measures are robust within the context of this dataset. A likely contributing factor to this performance consistency is the balanced distribution of class labels in the dataset. When classes are balanced, each splitting criterion is more or less equally likely to encounter informative splits, which helps to decrease performance variability across different splitting methods.

It can also be noticed that all nine trees have selected the attribute “Experience Level” as the first splitting attribute, indicating that it is the strongest predictor in reducing uncertainty among all other predictors and the most informative in our case.

so, this is the model that we have chosen as our classification model (gain ratio with 10-folds)

plot(gainRatio10$finalModel)

As for clustering, using k-means method with k=2, the analysis highlighted that k=2 displayed superior performance, standing out for its distinct clusters without overlap. The data within each cluster exhibited significant similarity while being notably dissimilar from the other cluster, affirming k=2 as the optimal choice for this clustering scenario based on evaluation methods like BCubed Precision and Recall, Average Silhouette Width and based on graphs.

This table displays the results of clustering using various methods for each K.

K=2 K=3 K=4
Average Silhouette width 0.34 0.31 0.2
total within-cluster sum of square 7295.548 6778.046 5915.724
BCubed precision 0.2812713 0.2807635 0.2895803
BCubed recall 0.7064208 0.6786707 0.401187
Visualization

Based on these metrics, we can assess the performance of each K value:

Considering these metrics, we can conclude that K = 2 is the optimal k and performs comparatively better in terms of average silhouette width, BCubed precision, and BCubed recall. It indicates that the clustering with K = 2 leads to well-separated clusters without overlapping between clusters and with relatively good precision and recall. Which means the data in a cluster are close “similar” to each other and dissimilar to data in the other cluster “reflecting high intra-cluster similarity”. And the reason for the overlapping and the low average silhouette width , precision, and recall results in K=3 and K=4 is could be duo to presence of outliers or sensitivity to Initial Centroid and high inter-cluster similarity.

We can notice that these results that we have obtained from both calssification and clustering are interesting and strongly related to the core of the problem that we are working on, and will be directly reflecting on the solution.

Ultimately, both models prove valuable in predicting cybersecurity salaries and grouping the employees based on shared charactristics. The choice between classification and clustering hinges on specific objectives. Classification is advantageous for predicting salary ranges or categories based on known employee attributes, aiding individual compensation decisions. When it is essential to predict salary categories, employing the Gain ratio in classification proves beneficial. Clustering aids in identifying natural employee groupings, revealing trends and common characteristics, facilitating market segmentation, and informing broader strategic decisions. Pursuing the goal of uncovering natural groupings in salary data, K=2 clustering stands out for its distinct clusters.

In conclusion, both techniques are important and suitable for our dataset, duo to their ability to acheive our data mining tasks (predicting employees salaries and grouping them based on similarties). Thus both are critical to solve the problem introduced in this project, and both will help achieve our goals such as market segmentation, striking fairness among employees and increase their loyality.

In consequence, our solution is composed of two main parts that will help achieve our goals: 1- Use classification to predict employees’ salaries (using 10-fold gain ratio method) 2- Use clustering to group employees based on their similarties (using k-means with 2 clusters)

By addressing the issues, we can solve them such as unfairness, losing candidates to other companies that provide better privilages, and poor understanding of employees’ needs. Finally, by solving these problems, we can ensure the cybersecurity employees are satisfied and pleased with the salary they get, leading to a better performance at their jobs and better securing the organization’s data and valuable digital assets.

Refrences

[1] J. Han, M. Kamber, and J. Pei, “Data Mining: Concepts and Techniques,” 3rd ed., The Morgan Kaufmann Series in Data Management Systems.

[2] Y. Zhao, “R and Data Mining: Examples and Case Studies,” 1st ed. Academic Press, 2012. ISBN: 0123969638.

[3] Y. Zhao, “R and Data Mining,” RDataMining.com, Available: https://www.rdatamining.com/, Accessed on: November 23, 2023.

[4] M. Hahsler, “discretize {arules}R Documentation: Convert a Continuous Variable into a Categorical Variable,” R Project, Available: https://search.r-project.org/CRAN/refmans/arules/html/discretize.html, Accessed on: November 23, 2023.

---
title: "Cybersecurity salaries"
output: html_notebook
---

```{r}

knitr::opts_chunk$set(warning=FALSE)

```

### Needed libraries

```{r}
library(dplyr)
library(countrycode)
library(outliers)
library(caret)
library(cluster)
library(factoextra)
library(NbClust)
library("DMwR")
library("RWeka")
library("C50")
library("rpart")
library("themis")
library(rattle)
library(rpart.plot)
library(RColorBrewer)
```

# phase 1

### Problem statement

Prediction of cyber security employees' salaries based on 11 attributes & grouping employees based on shared characteristics.

1.work_year

2.experience_level

3.employment_type

4.job_title

5.salary

6.salary_currency

7.salary_in_usd

8.employee_residence

9.remote_ratio

10.company_location

11.company_size

### Problem description

We are living in the "information age" or rather the "data age", meaning that everything around us revolves around data. The data has become one of the most valuable assets that a person or an organisation can have, since it has a significant value, losing it will lead to significant damages. Thus, most of the attacks nowadays are directed toward the data. To guard against such damages, organisations have realised the importance of protecting their digital assets, leading them to hire cybersecurity specialists. This made cybersecurity gain popularity among people so there's a growing tendency to study cybersecurity. Consequently this resulted in the emergence of plentiful professionals with various experience levels and skills in this field. As a result, organisations may find it difficult to decide a salary for job candidates solely based on the CV. also, since the attacks improve rapidly, organisations need to hire more employees in the far future to defend against such attacks but it's not an easy matter to predict the future payroll which may hinders some of the organisation's plans. Another issue arises when the decision makers in the organisation aren't fully aware of the different groups of employees and their differint needs. Their lack of awareness gives a chance for the competitor organisations to attract their employees to them by offering a better salary and privilages that match their needs.

### Data mining task

Prediction of the cyber security employees' salary categories (Very Low, Low, , High, Very High) using classification, and description of data characteristics and behavior and grouping data using clustering methods.

### Goal

Given the problems we discussed and In order to better understand this field, we decided to analyse a dataset of 1247 cybersecurity employees, containing information such as salary, job title, and experience level. Analysing this dataset can provide insightful predictions regarding the salary range of a cybersecurity employee and description of the cybersecurity market behavior by grouping the data, which can help in:

-   Market segmentation
-   Identify trends
-   Specifying common charactrestics among cybersecurity employees
-   Identify the main cybersecurity employee groups for better understanding their needs
-   Making better decisions
-   Making recruitment and hiring process easier and more efficient
-   Predicting the future payroll
-   Increasing loyalty
-   Increasing the satisfaction rate
-   Achieving fairness
## Data

## Source of data:

<https://www.kaggle.com/datasets/deepcontractor/cyber-security-salaries>

### Reading and viewing dataset

```{r}
dataset= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/salaries_cyber.csv"), header=TRUE)
View(dataset)

```

### Original dataset

we will keep a copy of the original dataset before data preprocessing to use if needed at any time

```{r}
originalDataset= dataset
```

## General information about the dataset:

No. of attributes: 11\
Type of attributes: Ordinal , Nominal, and Numeric\
No. of objects: 1247\
Class label: salary_in_usd

```{r}
ncol(dataset)
nrow(dataset)
names(dataset)
str(dataset)
```

### Attributes' description table

+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| **Attribute Name** | **Description**                                             | **Data Type** | **Possible values**                                       |
+====================+=============================================================+===============+===========================================================+
| work_year          | The year in which salary was recorded                       | Numerical     | 2020 to 2022                                              |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| experience_level   | Expertise level of the employee                             | Ordinal       | En "Entry level"\                                         |
|                    |                                                             |               | MI "Mid level"\                                           |
|                    |                                                             |               | SE "Senior level"\                                        |
|                    |                                                             |               | EX "Executive level"                                      |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| employment_type    | The nature or category of employee's engagement in the job  | Nominal       | PT "Part time"\                                           |
|                    |                                                             |               | FT "Full time"\                                           |
|                    |                                                             |               | CT "Contract\                                             |
|                    |                                                             |               | FL"Freelancer"                                            |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| job_title          | The role worked in during the year                          | Nominal       | Different titles.                                         |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like Security Analyst, security researcher                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary             | The total gross salary amount paid                          | Numerical     | 1740-50001566                                             |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary_currency    | The currency of the salary paid to the employee             | Nominal       | Different currencies according to ISO 4217 currency code. |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like DE,CA                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| salary_in_usd      | The salary paid in United states dollar                     | Numerical     | 2000 to 365596.40                                         |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| employee_residence | Employee's primary country of residence                     | Nominal       | Different countries.                                      |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like US,AE                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| remote_ratio       | Percentage of online work by employee in the specified year | Numerical     | 0 "No remote work"\                                       |
|                    |                                                             |               | 50 "Partially remote"\                                    |
|                    |                                                             |               | 100 "Fully remote"                                        |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| company_location   | The country of the employer's main office                   | Nominal       | Different countries.                                      |
|                    |                                                             |               |                                                           |
|                    |                                                             |               | like BR,BW                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+
| company_size       | How big/small is the company                                | Ordinal       | S , M or L                                                |
+--------------------+-------------------------------------------------------------+---------------+-----------------------------------------------------------+

# phase 2

### sample of 20 employees from the dataset:

using sample_n(table,size) function and using (set_seed())

```{r}
set.seed(30)
sample=sample_n(dataset,20)
print(sample)
```

### Show the missing value:

if it is FALSE it means no null value,if it is TRUE there is null value. In our dataset there is no null values.

```{r}
is.na(dataset)
sum(is.na(dataset))
```

### Show the Min.,1st Qu.,Median,Mean ,3rd Qu.,Max. for each numeric column

The summary statistics for the dataset variables provide insights into the distribution of features. we can conclude the following:

In work_year: The data spans from the year 2020 to 2022 with Most data falling within the years 2021 and 2022, as indicated by both the median and mean being centered around 2021.

In salary: Salaries vary widely with a minimum of \$1,740 and a maximum of \$500 million. The median is \$120,000 which is a mid value, but the mean is notably higher at \$560,852 which might be duo to extreme values or notable skewness.

In salary_in_usd: The data has a median of \$110,000, and a mean of \$120,278, and the spread of salaries is observable in the difference between the median and mean.

In remote_ratio: Indicates the percentage of remote work ranging from 0% to 100%, with a median and 3rd quartile at 100%, and a mean of 71.49%, indicating a notable presence of remote work in the dataset, suggesting some variability.

```{r}
summary(dataset$work_year)
summary(dataset$salary)
summary(dataset$salary_in_usd)
summary(dataset$remote_ratio)
```

### Show the variane of each numeric column

variance is to understand the spread or dispersion of the values in each column. A higher variance indicates that the values are more spread out from the mean and in our dataset the highest varied attribute is salary, while a lower variance indicates that the values are closer to the mean which in our datas it is work year attribute.

Variance results reveal that: -work years are to some extent consistent -salaries show notable variability and possible outliers -salaries in USD have a stable distribution -remote work ratio have moderate variability

```{r}
var(dataset$work_year)
var(dataset$salary)
var(dataset$salary_in_usd)
var(dataset$remote_ratio)
```

### Visualization of relationship between some pairs of attributes:

Here we used boxplot to see the distribution between salary_in_usd and experience_level We observed that salaries vary depending on the level of experience,they are positively correlated.

```{r}
boxplot(salary_in_usd ~ experience_level, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
```

Here we used boxplot to see the distribution between salary_in_usd and work_year We observed that 2021 salaries were close to each other but in 2022 the gap between them getting bigger.

```{r}
boxplot(salary_in_usd ~ work_year, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
```

Here we used boxplot to see the distribution between salary_in_usd and employment_type We observed that Full Time (FT) offers more salary than the other categories.

```{r}
boxplot(salary_in_usd ~ employment_type, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999)
```

Here we used boxplot to see the distribution between salary_in_usd and company_size We observed that the larger the company is the higher the salary was.

```{r}
boxplot(salary_in_usd ~ company_size, data = dataset , yaxt="n")
labels<- pretty(dataset$salary_in_usd)
labels<- sapply(labels, function(x) format(x, scientific = FALSE))
axis(side = 2, at=pretty(dataset$salary_in_usd), labels = labels )
options(scipen = 999) 
```

## Data preproccessing

## Data Reduction

### Dimensionality Reduction

The "salary" column gives the same information as "salary_in_usd" it's just a matter of currency exchange, and we will eventually transform all the values in "salary" column to one common currency so we can properly deal with them. To further confirm that the two column are redundant, we will use the latest exchange rate for USD to the desired currency.

we will start by creating a temporary column named "converted_salary" to save the salary that we will get by using the exchange rate to convert the salary_in_usd to the salary with different currencies to compare with "salary" column

```{r}
convertedDataset=dataset


convertedDataset$exchange_rate = factor(convertedDataset$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1/1,1/0.20,1/1.22,1/1.06,1/0.012,1/0.74,1/1.10,1/0.14,1/0.73,1/0.64,1/0.090,1/0.057,1/0.26,1/0.23,1/0.093,1/0.000065,1/0.60,1/0.0027,1/0.053,1/0.031,1/0.010))
convertedDataset$exchange_rate = as.numeric(as.character(convertedDataset$exchange_rate))
convertedDataset$converted_salary = convertedDataset$salary_in_usd*convertedDataset$exchange_rate



set.seed(1)
salary_sample <- sample_n(convertedDataset[,c("salary","converted_salary")],10)

print(salary_sample)
```

as shown in the sample, the two columns are almost identical. This can be proved by correlation coefficient as well.

```{r}
correlation <- cor(convertedDataset$salary , convertedDataset$converted_salary)
print(correlation)
```

The correlation is so high but it hasn't reached 100% possibly due to rounding in the calculations and slight differences in the exchange rate over time.

To make the mining process more effiecent and has an improved quality, we decided to remove the "salary" column.

```{r}
dataset<-dataset[,-c(5)]
```

### Find the outliers and remove them:

We will show outliers with boxPlots and then remove them, to minimize noise and to get better analytical results when applying data mining techniques.

now we show (salary_in_usd) attributes' outliers. we can see that there are many outliers with exceptionally high values, thus we will remove them.

```{r}
boxplot(dataset$salary_in_usd)



OutSalary = outlier(dataset$salary_in_usd, logical =TRUE)
Find_outlier = which(OutSalary ==TRUE, arr.ind = TRUE)
dataset= dataset[-Find_outlier,]

```

now we show (remote_ratio) attributes' outliers. we can see there aren't outliers in remote_ratio, thus we don't need the last step i.e: removing outliers' rows.

```{r}
boxplot(dataset$remote_ratio)

```

now we show (work_year) attributes' outliers. we can see there aren't outliers in work_year, thus we don't need the last step i.e: removing outliers' rows.

```{r}
boxplot(dataset$work_year)

```

### Concept hierarchy generation for nominal data

the columns "company_location" and "employee_residence" have the name of countries for the company and employee respectively. And these attributes can be generalized to higher-level concept that is region to help understand and analyze the dataset better and improve algorithm performance.

We will use the 7 regions as defined in the World Bank Development Indicators. These regions are:

1.  East Asia and Pacific: This region includes countries like China, Australia, Indonesia, Thailand, etc.

2.  Europe and Central Asia: This region includes countries like Germany, UK, Russia, Turkey, etc.

3.  Latin America & Caribbean: This region includes countries like Brazil, Mexico, Argentina, Cuba, etc.

4.  Middle East and North Africa: This region includes countries like Saudi Arabia, Egypt, Iran, Iraq, etc.

5.  North America: This is predominantly United States and Canada.

6.  South Asia: This region includes countries like India, Pakistan, Bangladesh, Sri Lanka, etc.

7.  Sub-Saharan Africa: This region includes countries like Nigeria, South Africa, Ethiopia, Kenya, etc.

Note: UM(The United States Minor Outlying Islands) and AQ(Antarctica) don't belong to any of these regions, thus, they will be used as they are.

```{r}


um=which(dataset$company_location=="UM")
aq=which(dataset$company_location=="AQ")


dataset$company_location <- countrycode(dataset$company_location, "iso2c", "region")
dataset$employee_residence <- countrycode(dataset$employee_residence, "iso2c", "region")

dataset[um,"company_location"]="UM"
dataset[aq,"company_location"]="AQ"

```

Concept hierarchy generation can be done for "job_title" as well to improve interpretation and scalability. Also, most job titles are essentially the same job but with different names, so we can combine them into a higher-level jobs titles such as Architect, Analyst and Engineer etc.

```{r}
## Create the categories based on job rank 
dataset$job_title <- ifelse(grepl("Analyst", dataset$job_title), "Analyst",
                                ifelse(grepl("Architect", dataset$job_title), "Architect",
                                       ifelse(grepl("Engineer", dataset$job_title), "Engineer",
                                              ifelse(grepl("Manager|Officer|Director|Leader", dataset$job_title), "Leadership",
                                                     ifelse(grepl("Consultant|Specialist", dataset$job_title), "Consultant/Specialist",
                                                            ifelse(grepl("Cyber", dataset$job_title), "Cyber Security",
                                                                   "Others"))))))

```

## Encoding categorical data

To deal with columns with character type we are going to encode them, because most machine learning algorithms are designed to work with factors data rather than character data and it improves performance and Interpretability of data as well.

```{r}
dataset$job_title  <- factor(dataset$job_title)

dataset$experience_level = factor(dataset$experience_level, levels=c("EN", "MI", "SE", "EX"), labels=c(1,2,3,4))

dataset$employment_type  <- factor(dataset$employment_type)

dataset$employee_residence  <- factor(dataset$employee_residence)

dataset$company_location  <- factor(dataset$company_location)

dataset$salary_currency  <- factor(dataset$salary_currency)

dataset$job_title  <- factor(dataset$job_title)


dataset$company_size = factor(dataset$company_size, levels=c("S","M","L"), labels=c(1,2,3))


dataset$job_title  <- factor(dataset$job_title)

```

### Discretization of salaray_in_usd attribute

by calculating breaks based on quartiles

```{r}
breaks <- quantile(dataset$salary_in_usd, 
                   probs = c(0, .25, .5, .75, .95, 1), 
                   na.rm = TRUE)


dataset$salary_in_usd <- cut(dataset$salary_in_usd, 
                                       breaks = breaks, 
                                       include.lowest = TRUE, 
                                       labels=c("Very Low", "Low", "Medium", "High", "Very High"))


```

### Normalization:

to change the scale of numeric attributes (remote_ratio and work_year) to a scale of [-1,1] to give them equal weight

```{r}
dataset [, c("work_year" , "remote_ratio")] = scale(dataset [, c("work_year" , "remote_ratio")])
```

## Feature Selection

we will implement feature selection to remove redundant or irrelevant attributes from the data set to get the smallest subset that can help us get the most accurate predictions for our target class(salary_in_usd) and decrease the time that it takes the classifier to process the data.

we will use RFE(Recursive feature elimination) which is a wrapper method for the feature selection. Since the RFE function have multiple control options we need to specify the options that we want. We will choose "Random Forest" because it has high accuracy, can handle categorical data.

```{r}
control <- rfeControl(functions = rfFuncs, 
                      method = "repeatedcv",
                      repeats = 5, 
                      number = 10)
```

First we save the features to be used in the feature selection(every attributes except the class label "salary_in_usd") in variable x, and the class label in variable y. Then split the data to 80% training and 20% test.

```{r}
x <- dataset %>%
  select(-salary_in_usd) %>%
  as.data.frame()

# Target variable
y <- dataset$salary_in_usd

# Training: 80%; Test: 20%
set.seed(2021)
inTrain <- createDataPartition(y, p = .80, list = FALSE)[,1]

x_train <- x[ inTrain, ]
x_test  <- x[-inTrain, ]

y_train <- y[ inTrain]
y_test  <- y[-inTrain]

```

after splitting the data, now we can perform the selection using rfe

```{r}
set.seed(1)
result_rfe1 <- rfe(x = x_train, 
                   y = y_train, 
                   sizes = c(1:9),
                   rfeControl = control)

result_rfe1

predictors(result_rfe1)

```

The results show that all the remaining attributes, except for "employment_type", are selected. This is logical, as 98% of the rows have the value "FT", as shown in the table below. Due to the low variance, we decided to remove this attribute.

```{r}
table(dataset$employment_type)
```

```{r}
dataset<-dataset[,-which(names(dataset)=="employment_type")]
```

# phase 3

During this phase, our focus will be on clustering and classification techniques to analyze the data. The primary objectives are to identify distinct groups within the dataset through clustering, classify data objects into meaningful categories, and apply different evaluation methods to assess the accuracy and precision of both classification and clustering results. We aim to gain deeper insights into the data and discover patterns.

## Retreive our preprocessed dataset

```{r}

# Read the CSV file from github
dataset2= read.csv(url("https://raw.githubusercontent.com/SarahAlhindi/DM_project/main/Data%20Set/preprocessedDataset.csv"), header=TRUE)

# Identify the character variables in the dataset2
char_vars <- sapply(dataset2, is.character)

# Convert the identified character variables in dataset2 to factors
dataset2[char_vars] <- lapply(dataset2[char_vars], as.factor)

```

## balancing data

To resolve the problem of class imbalance in the dataset, we will use SMOTE() method that oversample the minority class by creating synthetic samples using the existing minority class samples

```{r}
set.seed(10)
balanced_dataset <- SMOTE(salary_in_usd ~ ., dataset2, perc.over = 300, perc.under=500, k = 10)
```


## Data mining techniques


The goal of all preceding steps is to properly prepare the dataset for the classification and clustering, which constitutes one of our primary mining objectives. In this section, we will employ various attribute selection methods such as the Gini index, Gain ratio, and information gain to construct a decision tree model. We will thoroughly evaluate its performance, and if it proves effective, it can subsequently be utilized to classify new instances with unknown class labels. The process to predict is as follow, divide the data into training and data sets then training the model using the training set and test its performance using the test set.

since our dataset is small, we decided to use K-fold Cross-validation as a dataset partioning method. for each attribute selection method we will try different K size (10,5, and 3)

in all this section we will be using train and trainControl functions of caret package to produce decision trees. for Gini index the method will be "rpart” from "rpart"  package and for Gain ratio it's "j48" from "RWeka" package as for information gain the method is "C5.0" from "C50" package  .



Data clustering is a process to partition data into groups or clusters,it is an unsupervised learning process, which is excuted without knowing the class label of the training data. The data in the same group "cluster" are similar to one another and different from data in other clusters. And for this data mining task We will utilize k-means clustering. 
We will use the method "fviz_nbclust"  of the package "factoextra" to find the number of clusters based on the elbow method and the Silhouette coefficient. To use the kmeans clustering we will utilize the method “kmeans” of the package “stats”. To visualize the clusters, we will use the method “fviz_cluster” from the package “factoextra”. And finally to find the average silhouette for each cluster the method “silhouette” from the package “cluster” will be used. 


## Evaluation and Comparison



### Classification



the following function will be used to compute average sensitivity and Specificity:

```{r}


macro = function(matrix){
  
  sumSen=0
  
  for (i in 1:5) {
   sumSen = sumSen + matrix$byClass[i,1] 
  }
  
  
  avgSen = sumSen/5
  
  sumSpec=0
  
  for (i in 1:5) {
   sumSpec = sumSpec + matrix$byClass[i,2] 
  }
  avgSpec = sumSpec/5
  
  
  
  
  sumPrec=0
  
  for (i in 1:5) {
   sumPrec = sumPrec + matrix$byClass[i,3] 
  }
  avgPrec = sumPrec/5
  
  
  
  
  avgs = data.frame(Sensitivity=avgSen , Specificity=avgSpec, Precision=avgPrec ,Accuracy= unname( matrix$overall[1]) )
  print(avgs)
  
  
}


```

#### Gini index

Gini index measures the impurity of the dataset. The partitioning that yields the most substantial reduction in impurity is selected as the splitting attribute. To apply the Gini index, we will employ the "rpart" method, which utilizes the Gini index as the criteria for splitting.

##### 10 Folds

The tree of the gini index using 10 folds

```{r}

set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")

tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))

giniIndex10 <- train(
  salary_in_usd ~ .,
  data = balanced_dataset,
  method = "rpart",
  trControl = ctrl,tuneGrid=tuneGrid,
  control = list(
    minsplit = 10,
    minbucket = 5,
    xval = 10,
    cp = 0.0001
  )

)


prp(giniIndex10$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)

```

the "experince level" attribute was selected as the first splitting attribute meaning that it has the largest impurity reduction.

###### Confusion matrix of 10 folds using Gini Index

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}

giniIndex10cm = caret::confusionMatrix(giniIndex10$pred$obs,giniIndex10$pred$pred)

giniIndex10cm

```

the metrics shown for each class indicate the value of that metric when treating this class as the positive class and the other classes as the negative class. here the classifier showed best performance when using the "very high" class as the positive class but this value in its own doesn't hold much value since all classes should be taken into consideration.

##### 5 Folds

The tree of the gini index using 5 folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")


tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))

giniIndex5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart", trControl = ctrl,tuneGrid=tuneGrid,
  control = list(
    minsplit = 10,
    minbucket = 5,
    xval = 10,
    cp = 0.0001
  ))

prp(giniIndex5$finalModel, box.palette = "Reds", tweak = 1.5, varlen = 10, cex = 0.15)


```

this tree has the same structure as the previous tree that used 10 folds. so in this tree as well "experience level" was choose as the first splitting attribute

###### Confusion matrix of 5 folds using Gini Index

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}
giniIndex5cm = caret::confusionMatrix(giniIndex5$pred$obs,giniIndex5$pred$pred)

giniIndex5cm

```

the results are very close to the 10 folds tree, so here as well the classifier shows better performance when dealing with the "very high"

##### 3 Folds

The tree of the gini index using 3 folds

```{r}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")


tuneGrid <- expand.grid(cp = c(0.001, 0.005, 0.01))

giniIndex3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "rpart", trControl = ctrl,tuneGrid=tuneGrid,
  control = list(
    minsplit = 10,
    minbucket = 5,
    xval = 10,
    cp = 0.0001
  ))

prp(giniIndex3$finalModel, box.palette = "Reds", tweak = 1.5, varlen = 10, cex = 0.15)


```

The tree shows similar structure as the two previous two trees, whether it's in its choose of the splitting attributes or the leaves.

###### Confusion matrix of 3 folds using Gini Index

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}

giniIndex3cm = caret::confusionMatrix(giniIndex3$pred$obs,giniIndex3$pred$pred)

giniIndex3cm

```

here as well the "very high" class has the best overall performance

###### Analysis of the gini index classification

All three trees seem to be alike in their arrangement and form.

1.  Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the "experience level." This divides the tree into two main branches or subtrees:
    -   Right Subtree: This comprises instances with Senior (SE) and Executive (EX) experience levels.
    -   Left Subtree: This includes individuals with Entry (EN) and Mid (MI) experience levels.
2.  Right Subtree - work year: The next attribute used to further classify the right subtree is "work year." The decision criterion is:
    -   If work year is \<-1.8: Then it is high.
    -   If work year is NOT \<-1.8: The next attribute examined is "experience level."
3.  Left Subtree - Experience Level: On the left side of the tree, the attribute "experience level." is used to further bifurcate the instances:
    -   If experience level is \>=2: The next attribute examined is "experience level."
    -   If experience level is NOT \>=2: The next attribute also will examined is "experience level."

###### Sensitivity, Accuracy, Specifity and precision of all 3,5 and 10 folds using Gini Index

```{r}
rbind("10 Folds"=macro(giniIndex10cm), "5 Folds"=macro(giniIndex5cm), "3 Folds"=macro(giniIndex3cm)  ) 
```

The higher values for sensitivity, specificity, precision, and accuracy in the 10-fold case indicate better overall performance according to these metrics. so,Gini Index model performs better with a 10-fold cross-validation compared to 5 and 3 folds.

#### Gain ratio

The gain ratio, a normalized measure of information gain, is calculated by dividing information gain by the split information. The attribute that yields the highest gain ratio is chosen as the splitting attribute. The C4.5 algorithm employs the gain ratio.

The J48 is the Java-based open-source implementation of the C4.5 algorithm, and it is included in the Weka package. This implementation allows users to conveniently apply the C4.5 decision tree.

##### 10 Folds

The tree of the gain ratio using 10 folds

```{r , fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")
gainRatio10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio10$finalModel)
```

the first splitting attribute that was choosen is the "Expeirence level" attribute meaning that it probably has a high information gain and low splitInfo(Entropy of distribution of tuples into partition)

###### Confusion matrix of 10 folds using Gain ratio

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}
gainRatio10cm = caret::confusionMatrix(gainRatio10$pred$obs, gainRatio10$pred$pred)

gainRatio10cm


```

here the classifier shows better performance when treating "very high" and "very low" attributes as positive class. since the "very high" class is better in Sensitivity and "very low" is better in Specificity and precision (Pos Pred Value)

##### 5 Folds

The tree of the gain ratio using 5 folds

```{r , fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")
gainRatio5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio5$finalModel)
```

the tree is similar to the tree that was resulted from 10 folds. it has choose "Experience level" as the first splitting attribute and and seem to show similar behavior.

###### Confusion matrix of 5 folds using Gain ratio

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}

gainRatio5cm=caret::confusionMatrix(gainRatio5$pred$obs, gainRatio5$pred$pred)

gainRatio5cm

```

unlike the 10 folds, here the classifier has the best overall performance when considering the "very high" as the positive class.

##### 3 Folds

The tree of the gain ratio using 3 folds

```{r, fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")
gainRatio3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "J48",trControl = ctrl)
plot(gainRatio3$finalModel)
```

the tree shows similar behavior as the previous 2 trees that resulted from using 10 and 5 folds.

###### Confusion matrix of 3 folds using Gain ratio

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}
gainRatio3cm=caret::confusionMatrix(gainRatio3$pred$obs, gainRatio3$pred$pred)

gainRatio3cm

```

similar to teh 5 folds, the "very high" class has the best metrics.

###### Analysis of the gain ratio classification

The observed structure of the three decision trees seems to be the same and it can be summarized as follows:

1.  Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the "experience level." This divides the tree into two main branches or subtrees:
    -   Right Subtree: This comprises instances with Senior (SE) and Executive (EX) experience levels.
    -   Left Subtree: This includes individuals with Entry (EN) and Mid (MI) experience levels.
2.  Within the right subtree:
    -   If the 'experience level' is 4 (EX, Executive level) , the tree splits based on the 'Employee_residence' attribute. It checks whether the 'Employee_residence' is 'Latin America.'
    -   If 'Employee_residence' does not equal 'Latin America,' the differentiation continues with the 'remote_ratio' attribute, further dividing the tree.
3.  Within the left subtree:
    -   If the 'experience level' is 1 (EN, Entry level), the tree divide based on the 'Employee_residence' attribute, specifically checking for 'Sub-Saharan Africa.'
    -   If the 'experience level' is 2 (MI, Mid level), it also branches based on 'Employee_residence,' but in this case, looking to see if it equals 'North America.'

The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.

###### Sensitivity, Accuracy, Specifity and precision of all 3,5 and 10 folds using Gain ratio

```{r}
rbind("10 Folds"=macro(gainRatio10cm), "5 Folds"=macro(gainRatio5cm), "3 Folds"=macro(gainRatio3cm)  ) 
```

Based on the evaluation metrics of average Sensitivity,Precision ,Specificity, and Accuracy, it is evident that the gain ratio model, built using a 10-fold cross-validation approach, exhibits superior performance compared to the other two models. However, it's worth noting that the difference in performance between the models is relatively small.

A detailed examination of the results from the 10-fold cross-validation reveals that the model has a notably high specificity compared to other metrics. This high specificity suggests that the model is particularly effective at correctly identifying instances that do not pertain to the target class---essentially, it accurately recognizes when examples are not members of the specified class. For example, if the positive class in question is "High" then the model is able to correctly classify tuples that belong to "Very Low", "Medium", and "Very High".

However, possessing high specificity alone does not guarantee the overall effectiveness of the model, as a well-rounded model also requires balanced performance across other metrics. In this case, its ability to capture and classify instances that do belong to the positive class (as measured by sensitivity) is not as robust. For a model to be considered truly effective, it would need to demonstrate strong performance in all metrics specificity and sensitivity, ensuring it can accurately distinguish both negative and positive instances as well as accuracy precision.

#### Information gain

Information Gain is a metric used to decide which attribute to choose for splitting the data at each node in the decision tree. For a given dataset, the Information Gain of an attribute is calculated by comparing the entropy before and after the dataset is split based on that attribute. The attribute with the highest Information Gain is chosen as the splitting attribute.

##### 10 Folds

The tree of the information gain using 10 folds

```{r, fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 10, returnResamp="all", savePredictions="final")


infoGain10 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = balanced_dataset,
                       trials = infoGain10$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain10$bestTune$winnow))
plot(c5model)
```

from the tree, the "experince level" attribute was the first selected splitting attribute meaning that it has the highest information gain among all attributes

###### Confusion matrix of 10 folds using Information gain

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}
infoGain10cm= caret::confusionMatrix(infoGain10$pred$obs, infoGain10$pred$pred)

infoGain10cm

```

similar to the trees from the gini index and gain ratio, the classifier seem to have better performance when treating the "very high" class as the positive class

##### 5 Folds

The tree of the information gain using 5 folds

```{r, fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 5, returnResamp="all", savePredictions="final")


infoGain5 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = balanced_dataset,
                       trials = infoGain5$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain5$bestTune$winnow))
plot(c5model)
```

the tree has similar behavior as the 10 folds information gain tree

###### Confusion matrix of 5 folds using Information gain

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}
infoGain5cm = caret::confusionMatrix(infoGain5$pred$obs, infoGain5$pred$pred)

infoGain5cm

```

the classifier shows very close performance to the 10 folds information gain model

##### 3 Folds

The tree of the information gain using 3 folds

```{r, fig.height=70, fig.width=90}
set.seed(10)
ctrl <- trainControl(method = "cv", number = 3, returnResamp="all", savePredictions="final")


infoGain3 <- train(salary_in_usd ~ ., data = balanced_dataset, method = "C5.0",trControl = ctrl)

c5model <- C5.0(salary_in_usd ~ .,
                       data = balanced_dataset,
                       trials = infoGain3$bestTune$trials, 
                       rules = FALSE,
                       control = C5.0Control(winnow = infoGain3$bestTune$winnow))
plot(c5model)
```

the visisble parts of the tree seem to behave the same as the prevoius 2 fold sizes- 10 and 5.

###### Confusion matrix of 3 folds using Information gain

The following confusion Matrix will show the performance of the classifier using the predicted class labels and the actual class labels of our dataset

```{r}
infoGain3cm = caret::confusionMatrix(infoGain3$pred$obs, infoGain3$pred$pred)

infoGain3cm

```

since the tree is essentially similar to the previous two information gain trees the results that this tree shows is very close in performance to them as well.

###### Analysis of the information gain classification

The observed structure of the three decision trees seems to be the same and it can be summarized as follows:

1.  Root Node - Experience Level: The initial attribute used for splitting the dataset at the root node is the "experience level." This divides the tree into two main branches or subtrees:

    -   Right Subtree: This comprises instances with Senior (SE) and Executive (EX) experience levels.
    -   Left Subtree: This includes individuals with Entry (EN) and Mid (MI) experience levels.

2.  Within the right subtree: In the right sub tree if the experience level is 4(EX) the tree will be divided based on "Company location"

3.  Within the left subtree: In the left subtree it will divide the tree for both two experience levels 1(EN) and 2(MI) based on "employee residence" and when the "employee residence" is "North America" the tree will be further divided based on "salary currency" and when this attribute is equal to "USD" the division will be based on the "job title" attribute

The decision tree continues to select the most appropriate attributes for splitting at each node, progressively refining the decision process until it reaches the leaves, where final class labels are assigned to the instances.

###### Sensitivity, Accuracy, Specifity and precision of all 3,5 and 10 folds using Information gain

```{r}
rbind("10 Folds"=macro(infoGain10cm), "5 Folds"=macro(infoGain5cm), "3 Folds"=macro(infoGain3cm)  ) 
```

Based on the provided sensitivity, specificity, precision, and accuracy values there isn't a clear indication of the superiority of one fold over another for Information Gain model .we may need to consider additional factors or conduct further analysis to make a well-informed decision. as can be seen in the table the 10 folds has the best Specificity and Precision, meanwhile the 5 folds has the best Sensitivity and Accuracy.

### Clustering

Data clustering is a process to partition data into groups or clusters,it is an unsupervised learning process, which is excuted without knowing the class label of the training data. The data in the same group "cluster" are similar to one another and different from data in other clusters. And for this data mining task We will utilize k-means clustering.

#### 1- prepreocessing

we will encode the rest of factor columns to transform them into numeric types before clustering, enabling meaningful distance calculations using kmeans and other formulas, and allowing for maximum flexibility in data processing and interpretation. we will also remove the class label from the dataset as clustering is an unsupervised learning process, and we will preserve this class label in an attribute for later use. lastly, we will scale all numeric attributes in the dataset so they will be standarized.

```{r}

# view data

dataset3 <- dataset2
View(dataset3)

# Reserve the salary_in_usd (the class label) column in an attribute before removing it from the dataset for clustering

classLabel <- dataset3[, 5] 


# Remove the class lable from the dataset

dataset3 <- dataset3[, -5]

# encoding job_title variable

dataset3$job_title = factor(dataset3$job_title, levels=c("Analyst", "Architect", "Engineer", "Leadership", "Consultant/Specialist","Cyber Security","Others" ), labels=c(4,1,2,5,3,6,7))

# encoding salary_currency variable

dataset3$salary_currency = factor(dataset3$salary_currency, levels=c("USD","BRL","GBP","EUR","INR","CAD","CHF","DKK","SGD","AUD","SEK","MXN","ILS","PLN","NOK","IDR","NZD","HUF","ZAR","TWD","RUB"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21))

# encoding employee_residence variable

dataset3$employee_residence = factor(dataset3$employee_residence, levels=c("North America","Latin America & Caribbean","Sub-Saharan Africa", "Europe & Central Asia","East Asia & Pacific","South Asia","Middle East & North Africa"), labels=c(1,2,3,4,5,6,7))

# encoding company_location variable

dataset3$company_location = factor(dataset3$company_location, levels=c("North America","Latin America & Caribbean","Sub-Saharan Africa", "Europe & Central Asia","East Asia & Pacific","South Asia","Middle East & North Africa", "AQ", "UM"), labels=c(1,2,3,4,5,6,7,8,9))


 
#Data types to be transformed into numeric types before clustering
#Transforming all non-numeric attributes to numeric type


dataset3$experience_level <- as.numeric(as.character(dataset3$experience_level))

dataset3$job_title <- as.numeric(as.character(dataset3$job_title))

dataset3$salary_currency <- as.numeric(as.character(dataset3$salary_currency))

dataset3$employee_residence <- as.numeric(as.character(dataset3$employee_residence))

dataset3$company_location <- as.numeric(as.character(dataset3$company_location))

dataset3$company_size <- as.numeric(as.character(dataset3$company_size))

# viwe the class of attributes to ensure they have transformed to numeric
sapply(dataset3, class)


#scale all attributes in the dataset so they would be standardized 
dataset3 <- scale(dataset3)

```

#### 2- K-means

After preprocessing the data, now we are ready to perform the clustering process, we will use the k-means clustering, it is a clustering method that aims to minimize the sum of squared distances between each data point and the centroid of its assigned cluster by iteratively updating cluster assignments and centroids.

#### 3- Choosing number of clusters (k)

We will choose 3 different numbers to perform the k-means clustering on, one of the numbers should be relatevily large, the second should be in the middle and the last should be small. This way we will cover the possible outcomes and clustering results.

##### a- Silhouette method

Now we will apply Silhouette method to find the optimal number of clusters k, we will also plot a graph where x-axis represent the number of clusters and y-axis represent the average Silhouette coefficient

```{r}

fviz_nbclust(dataset3, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette method")

```

as seen by the graph, the number of clusters k that maximizes the average Silhouette coefficient is 2, so we will use it for clustering.

##### b- Elbow method

This method determines the number of clusters according to the turning point in a curve, the curve is plotted using the total within-cluster sum of square (WSS) as in y-axis , and No. clusters in x-axis

```{r}

fviz_nbclust(dataset3, kmeans, method = "wss") +
  geom_vline(xintercept = 4, linetype = 2)+
  labs(subtitle = "Elbow method")

```

As shown, the number of clusters k that represents the turning point in the curve is 4, so we will use it for clustering.

Lastly, we will use k=3 since it acheives the second highest average Silhouette coefficient, and since it's in the middle between 2 and 4 it will strike a balance between having too few clusters (k=2), and having several clusters (k=4), Thus, this choice will have an acceptable acuuracy.

#### k-means clustering, visualization and evaluation

In this section, we will perform k-means clustering and visualize its result using three different k's that have been chosen beforehand, then we will compute WSS and Bcubed preceision and recall and average silhouette for each cluster as methods of evaluating clustering results.

##### k=2

```{r}

#Use seed to guarantee replicability of random processes
set.seed(8953)

# run k-means clustering to find 2 clusters
kmeans.result <- kmeans(dataset3, 2)

# print the clusterng result
kmeans.result

# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)


#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3)) 
fviz_silhouette(avg_sil)

#Within-cluster sum of squares wss 
wss <- kmeans.result$tot.withinss
print(wss)

#BCubed
kmeans_cluster <- c(kmeans.result$cluster)

ground_truth <- c(classLabel)

data <- data.frame(cluster = kmeans_cluster, label = ground_truth)


  bcubed <- function(data) {
  n <- nrow(data)
  total_precesion <- 0
  total_recall <- 0

for (i in 1:n) {
  cluster <- data$cluster[i]
  label <- data$label[i]
    
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
    
# Number of objects that are in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
    
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
    

total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
  }

  # compute avg precision and recall
  precision <- total_precesion / n
  recall <- total_recall / n

  return(list(precision = precision, recall = recall)) }


# compute BCubed precision and recall
metrics <- bcubed(data)


precision <- metrics$precision
recall <- metrics$recall

# Print results
cat("BCubed Precision is:", precision, "\n")
cat("BCubed Recall is:", recall, "\n")
```

we can conclude from the graph and the results that the k=2 is the optimal k, since there is no overlapping between the two clusters, the data in a cluster are close "similar" to each other and dissimilar to data in the other cluster. Also, the recall is relatively high (0.71) and is the highest among the k's chosen, the Precision is low (0.28) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 7287.657, indicating a good compactness of clusters, and that objects in a cluster are similar to one another noting that the higher the k, the lower the WSS. Lastly, the average silhouette width is 0.34 which is considered high reflecting high intra-cluster similarity.

##### k=3

```{r}

#Use seed to guarantee replicability of random processes
set.seed(8953)

# run k-means clustering to find 3 clusters
kmeans.result <- kmeans(dataset3, 3)

# print the clusterng result
kmeans.result

# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)

#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3)) 
fviz_silhouette(avg_sil)

#Within-cluster sum of squares wss 
wss <- kmeans.result$tot.withinss
print(wss)

#BCubed 
kmeans_cluster <- c(kmeans.result$cluster)

ground_truth <- c(classLabel)

data <- data.frame(cluster = kmeans_cluster, label = ground_truth)

  bcubed <- function(data) {
  n <- nrow(data)
  total_precesion <- 0
  total_recall <- 0

  for (i in 1:n) {
    cluster <- data$cluster[i]
    label <- data$label[i]
    
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
    
# Number of objects that are in the same cluster
total_same_cluster <- sum(data$cluster == cluster)
    
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
    

total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
  }

  # compute avg precision and recall
  precision <- total_precesion / n
  recall <- total_recall / n

  return(list(precision = precision, recall = recall))
}

# compute BCubed precision and recall
metrics <- bcubed(data)


precision <- metrics$precision
recall <- metrics$recall

# Print results
cat("BCubed Precision is:", precision, "\n")
cat("BCubed Recall is:", recall, "\n")
```

we can conclude from the graph and the results where k=3 is that the performance is good but worse than k=2, because there is overlapping between clusters. In addition, the recall is relatively low (0.45), However, the Precision is low (0.31) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 6451.51, indicating an intermidiate compactness of clusters, and that objects in a cluster are to some extent similar to one another. Lastly, the average silhouette width is 0.19 which reflects high inter-cluster similarity.

##### k=4

```{r}
#Use seed to guarantee replicability of random processes
set.seed(8953)

# run k-means clustering to find 4 clusters
kmeans.result <- kmeans(dataset3, 4)

# print the clusterng result
kmeans.result

# visualize clustering
fviz_cluster(kmeans.result, data = dataset3)

#average silhouette for each clusters
avg_sil <- silhouette(kmeans.result$cluster,dist(dataset3)) 
fviz_silhouette(avg_sil)

#Within-cluster sum of squares wss 
wss <- kmeans.result$tot.withinss
print(wss)

#BCubed
kmeans_cluster <- c(kmeans.result$cluster)

ground_truth <- c(classLabel)

data <- data.frame(cluster = kmeans_cluster, label = ground_truth)


  bcubed <- function(data) {
  n <- nrow(data)
  total_precesion <- 0
  total_recall <- 0

  for (i in 1:n) {
    cluster <- data$cluster[i]
    label <- data$label[i]
    
# Number of objects in the same category and cluster
intersection <- sum(data$label[data$cluster == cluster] == label)
    
# Number of objects in the same  cluster
total_same_cluster <- sum(data$cluster == cluster)
    
# Number of objects that have the same category
total_same_category <- sum(data$label == label)
    
# Calculate precision and recall for the current item and add them to the sums
total_precesion <- total_precesion + intersection /total_same_cluster
total_recall <- total_recall + intersection / total_same_category
  }

  # Compute avg precision and recall
  precision <- total_precesion / n
  recall <- total_recall / n

  return(list(precision = precision, recall = recall))
}

# compute BCubed precision and recall
metrics <- bcubed(data)


precision <- metrics$precision
recall <- metrics$recall

# Print results
cat("BCubed Precision is:", precision, "\n")
cat("BCubed Recall is:", recall, "\n")
```

we can conclude from the graph and the results where k=4 is that the performance is worse than k=2 and k=3, because there is a noticeable overlapping between clusters. Also, the clusers' space is pretty wide which results in a large distance between objects in the same cluster. In addition, the recall is relatively low (0.43) which might be a result of the overlapping and large distances between data objects. Furthermore, the Precision is low (0.29) which could be duo to presence of outliers or sensitivity to Initial Centroid. We can also note that the WSS is 5911.05 indicating a lower compactness of clusters. Lastly, the average silhouette width is 0.22 which is low reflecting high inter-cluster similarity.




## Clustering results

This table displays the results of clustering using various methods for each K.

+----------------------------------------+-------------------+----------------+-----------------------------+
|                                        | K=2               | K=3            | K=4                         |
+:======================================:+:=================:+:==============:+:===========================:+
| **Average Silhouette width**           | 0.34              | 0.31           | 0.2                         |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **total within-cluster sum of square** | 7295.548          | 6778.046       | 5915.724                    |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **BCubed precision**                   | 0.2812713         | 0.2807635      | 0.2895803                   |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **BCubed recall**                      | 0.7064208         | 0.6786707      | 0.401187                    |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **Visualization**                      | ![](images/k2-01) | ![](images/k3) | ![](images/k4){width="145"} |
+----------------------------------------+-------------------+----------------+-----------------------------+



## Phase 4




## Findings

While working on this project, we have prepared the data to actually implement data mining techniques on it, i.e: classification and clustering. As discussed in the previous section,


this table displays all results of the three classification algorithms (Gini index, information gain and gain ratio) with 3 different fold sizes (k=3,5,10)

```{r}

 

print( 
  rbind("10 Folds" = c(" ", " ", "", ""), 
        "gini index" = macro(giniIndex10cm), 
        "Gain ratio" = macro(gainRatio10cm), 
        "Information gain" = macro(infoGain10cm), 
        "5 Folds" = c(" ", " ", "", ""), 
        "gini index" = macro(giniIndex5cm), 
        "Gain ratio" = macro(gainRatio5cm), 
        "Information gain" = macro(infoGain5cm), 
        "3 Folds" = c(" ", " ", "", ""), 
        "gini index" = macro(giniIndex3cm), 
        "Gain ratio" = macro(gainRatio3cm), 
        "Information gain" = macro(infoGain3cm))

)





```

Based of this metrics we can further assess the performance of each method as follows:
-	Accuracy: the accuracy of gain ratio 10 folds is 63% which is higher than the others meaning that the model has successfully classified 63% of the instances. And comes after it the Gain ratio model with 5 folds with 62% accuracy. The worst was 3-fold gini index with only 56% accuracy

-	Precision: the higher precision was achieved in gain ratio 10 folds with 63% as well, this indicates that among the all the instances that the model has classified as positive 63% of them were correct. And comes right after it the 5 folds gain ratio with precision=61.9%. as for the worst performance based on Precision it was the 3 folds gini index with only 55% precision.

-	specificity: this time many models have achieved high performance but the highr was 10 folds gini index with 90.9% specificity, which means that among the instance that are negative (classes other than the positive one) it has correctly classified 90% of them. As noted before, the performance was high in many models like the 5 folds gain ration with 90.6% and 3 folds gain ratio with 90.5% and finally the 10 and 5 folds information gain with 90.2%. The worst performance was in the 3 folds gini index with 89% which is still considered high.

-	Sensitivity: the higher sensitivity was in 10-folds gain ratio with 63% meaning that among the positives the model has correctly classified 63% of them. The one right after it 5 folds gain ratio with 61%. As for the worst it’s the 3-folds gini index with 54% sensitivity.


If we look at each fold separately, we notice that the gain ratio has the best performance according to all 4 metrics (accuracy, precision, sensitivity and specificity), so for the 10, 5, and 3 folds the gain ratio is the best. It can also be noticed that the 3-folds gini index was the worst in all aspects but the differences with the other models isn’t that high in most of the metrics.

So if we were to choose one among them it will be the 10-fold gain ratio. The gain ratio evaluated with 10-fold cross-validation appears to have the best performance among all the decision tree models. This might be because the gain ratio tends to favor unbalanced splits, where one partition is significantly smaller than the others. In our dataset, if an attribute has a rare value, the gain ratio may prioritize splitting on this attribute, despite the resulting imbalance.

Despite the superiority of the 10-fold cross-validation, the performance metrics of all three fold sizes (3-fold, 5-fold, and 10-fold) using the Gini index, gain ratio, and information gain are relatively similar. with the gain ratio giving best performance in each partition. This suggests that all three measures are robust within the context of this dataset. A likely contributing factor to this performance consistency is the balanced distribution of class labels in the dataset. When classes are balanced, each splitting criterion is more or less equally likely to encounter informative splits, which helps to decrease performance variability across different splitting methods.

It can also be noticed that all nine trees have selected the attribute "Experience Level" as the first splitting attribute, indicating that it is the strongest predictor in reducing uncertainty among all other predictors and the most informative in our case.





so, this is the model that we have chosen as our classification model (gain ratio with 10-folds)

```{r , fig.height=70, fig.width=90}
plot(gainRatio10$finalModel)
```




As for clustering, using k-means method with k=2, the analysis highlighted that k=2 displayed superior performance, standing out for its distinct clusters without overlap. The data within each cluster exhibited significant similarity while being notably dissimilar from the other cluster, affirming k=2 as the optimal choice for this clustering scenario based on evaluation methods like BCubed Precision and Recall, Average Silhouette Width and based on graphs.

This table displays the results of clustering using various methods for each K.

+----------------------------------------+-------------------+----------------+-----------------------------+
|                                        | K=2               | K=3            | K=4                         |
+:======================================:+:=================:+:==============:+:===========================:+
| **Average Silhouette width**           | 0.34              | 0.31           | 0.2                         |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **total within-cluster sum of square** | 7295.548          | 6778.046       | 5915.724                    |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **BCubed precision**                   | 0.2812713         | 0.2807635      | 0.2895803                   |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **BCubed recall**                      | 0.7064208         | 0.6786707      | 0.401187                    |
+----------------------------------------+-------------------+----------------+-----------------------------+
| **Visualization**                      | ![](images/k2-01) | ![](images/k3) | ![](images/k4){width="145"} |
+----------------------------------------+-------------------+----------------+-----------------------------+

Based on these metrics, we can assess the performance of each K value:

-   Average Silhouette Width: The highest value is achieved at K = 2 (0.34), indicating that the clusters are relatively well-separated (not sparse) compared to the other K values. The Other values (0.2 and 0.31) for k=3,4 respectivley indeicate an acceptable seperation within the clusters.

-   Total Within-Cluster Sum of Squares: The lowest value is observed at K = 4 (5915.724), indicating better cluster compactness compared to K= 2 and K = 3. The other values (7295.548 and 6778.046 ) for k=2,3 respectivley show a moderate compactness of clusters which could be duo to presence of outliers or sensitivity to certain values. 

-   BCubed Precision and Recall: Both precision and recall values are highest at K = 2, suggesting a better match between the clustering assignments and the ground truth or desired clustering structure. We can notice that recall is acceptable in k=3 and low in k=4 which can be duo to overlapping and sparse data within the cluster, Also, the precisions are pretty close.

-   Visualization: At K = 3 and K = 4, there is an overlap between the clusters, which mostly indcate an accaptable to low performance, unlike K = 2, which provides two pure clusters indicating better grouping of objects.

Considering these metrics, we can conclude that K = 2 is the optimal k and performs comparatively better in terms of average silhouette width, BCubed precision, and BCubed recall. It indicates that the clustering with K = 2 leads to well-separated clusters without overlapping between clusters and with relatively good precision and recall. Which means the data in a cluster are close "similar" to each other and dissimilar to data in the other cluster "reflecting high intra-cluster similarity". And the reason for the overlapping and the low average silhouette width , precision, and recall results in K=3 and K=4 is could be duo to presence of outliers or sensitivity to Initial Centroid and high inter-cluster similarity.


We can notice that these results that we have obtained from both calssification and clustering are interesting and strongly related to the core of the problem that we are working on, and will be directly reflecting on the solution.

Ultimately, both models prove valuable in predicting cybersecurity salaries and grouping the employees based on shared charactristics. The choice between classification and clustering hinges on specific objectives. Classification is advantageous for predicting salary ranges or categories based on known employee attributes, aiding individual compensation decisions. When it is essential to predict salary categories, employing the Gain ratio in classification proves beneficial. Clustering aids in identifying natural employee groupings, revealing trends and common characteristics, facilitating market segmentation, and informing broader strategic decisions. Pursuing the goal of uncovering natural groupings in salary data, K=2 clustering stands out for its distinct clusters.
 
In conclusion, both techniques are important and suitable for our dataset, duo to their ability to acheive our data mining tasks (predicting employees salaries and grouping them based on similarties). Thus both are critical to solve the problem introduced in this project, and both will help achieve our goals such as market segmentation, striking fairness among employees and increase their loyality. 

In consequence, our solution is composed of two main parts that will help achieve our goals: 
1- Use classification to predict employees' salaries (using 10-fold gain ratio method)
2- Use clustering to group employees based on their similarties (using k-means with 2 clusters)

By addressing the issues,  we can solve them such as unfairness, losing candidates to other companies that provide better privilages, and poor understanding of employees' needs.
Finally, by solving these problems, we can ensure the cybersecurity employees are satisfied and pleased with the salary they get, leading to a better performance at their jobs and better securing the organization's data and valuable digital assets.



## Refrences

[1] J. Han, M. Kamber, and J. Pei, "Data Mining: Concepts and Techniques," 3rd ed., The Morgan Kaufmann Series in Data Management Systems.

[2] Y. Zhao, "R and Data Mining: Examples and Case Studies," 1st ed. Academic Press, 2012. ISBN: 0123969638.

[3] Y. Zhao, "R and Data Mining," RDataMining.com, Available: <https://www.rdatamining.com/>, Accessed on: November 23, 2023.

[4] M. Hahsler, "discretize {arules}R Documentation: Convert a Continuous Variable into a Categorical Variable," R Project, Available: <https://search.r-project.org/CRAN/refmans/arules/html/discretize.html>, Accessed on: November 23, 2023.
